Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find a way to not require {{ workingDir }} within a DuckDB query when input files are provided #267

Open
anna-geller opened this issue Apr 7, 2024 · 1 comment
Assignees
Labels
area/plugin Plugin-related issue or feature request enhancement New feature or request

Comments

@anna-geller
Copy link
Member

Feature description

See an example:

id: api_python_sql
namespace: blueprint

tasks:
  - id: extract
    type: io.kestra.plugin.fs.http.Download
    uri: https://dummyjson.com/products

  - id: transform
    type: io.kestra.plugin.scripts.python.Script
    docker:
      image: python:3.11-alpine # 57 MB
    inputFiles:
      data.json: "{{ outputs.extract.uri }}"
    outputFiles:
      - "*.json"
    script: |
      import json

      with open("data.json", 'r') as file:
        data = json.load(file)

      filtered_data = [{"brand": product["brand"], "price": product["price"]} for product in data["products"]]
      with open("products.json", 'w') as file:
        json.dump(filtered_data, file, indent=4)

  - id: sqlQuery
    type: io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      products.json: "{{ outputs.transform.outputFiles['products.json'] }}"
    sql: |
      INSTALL json;
      LOAD json;
      SELECT brand, round(avg(price), 2) as avg_price
      FROM read_json_auto('{{ workingDir }}/products.json') # ❌ this is error-prone and not intuitive
      GROUP BY brand
      ORDER BY avg_price DESC;
    store: true

Ideally, the same should work by using:

...
  - id: sqlQuery
    type: io.kestra.plugin.jdbc.duckdb.Query
    inputFiles:
      products.json: "{{ outputs.transform.outputFiles['products.json'] }}"
    sql: |
      INSTALL json;
      LOAD json;
      SELECT brand, round(avg(price), 2) as avg_price
      FROM read_json_auto('products.json')
      GROUP BY brand
      ORDER BY avg_price DESC;
    store: true
@anna-geller anna-geller added the enhancement New feature or request label Apr 7, 2024
@anna-geller anna-geller added this to the v0.17.0 milestone Apr 7, 2024
@anna-geller anna-geller modified the milestones: v0.17.0, v0.18.0 Apr 17, 2024
@github-project-automation github-project-automation bot moved this to Backlog in Issues Jun 10, 2024
@anna-geller anna-geller removed this from the v0.18.0 milestone Jul 4, 2024
@anna-geller
Copy link
Member Author

adding another example raised by @Ben8t -- this should work out of the box without having to use any extra tasks or {{workingDir}} expression:

id: train_model
namespace: kestra.weather.dev.ml

tasks:
  - id: get_data
    type: io.kestra.plugin.gcp.gcs.Downloads
    from: "gs://weather-kestra/data/"
    action: NONE
    listingType: RECURSIVE
    serviceAccount: "{{ secret('GCP_SA') }}"

  - id: query
    type: io.kestra.plugin.jdbc.duckdb.Query
    inputFiles: "{{ outputs.get_data.outputFiles }}"
    sql: |
      SELECT * FROM read_csv_auto("*.csv")
    store: true

@Ben8t Ben8t added the area/plugin Plugin-related issue or feature request label Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/plugin Plugin-related issue or feature request enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

3 participants