Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time Partitioned Incremental loading #1094

Closed
timhiebenthal opened this issue Mar 14, 2024 · 4 comments
Closed

Time Partitioned Incremental loading #1094

timhiebenthal opened this issue Mar 14, 2024 · 4 comments
Assignees
Labels
community This issue came from slack community workspace enhancement New feature or request

Comments

@timhiebenthal
Copy link

Feature description

Currently you need to specify a primary key to load data incrementally through a merge Disposition.

It's not uncommon - especially with "reporting API's" - that you don't have any specific unique key and also don't need one.
Currency you would need to build a surrogate/primary key to do incremental loads.

How about replacing the data not through the primary_key but through a time period?

Example:

  1. you pull data since 2024-03-01
  2. you delete data in the source since 2024-03-01
  3. you upload the data from step 1 into the source

Adding also a end_date makes sense but I wanted to keep the example simple

Are you a dlt user?

Yes, I run dlt in production.

Use case

Sometimes building the primary key to load incrementally is cumbersome to build (e.g. because in nested dictionaries)
So you need to do more transformations before loading than needed.

Proposed solution

Specify

  • a partition_type (e.g. "key" or "time") and
  • a partition_column (the name of the key- or time-column)

"Key" would work as currently.
"Time" would

  1. Identify the min() and max() value of the new increment to be uploaded
  2. Delete everything between min and max in the source
  3. Upload the new increment to the source

Related issues

No response

@rudolfix
Copy link
Collaborator

@timhiebenthal I think merge_key is doing more or less what you want. https://dlthub.com/docs/general-usage/incremental-loading#merge-incremental-loading

you can use it instead or with primary key to replace partitions of data (ie. days).
for completely custom partitions you can generate a merge column by adding add_map on the resource, you can approximate more granular time ranges ie. updated_at but with hourly resolution let's you replace data with hourly granularity.
https://dlthub.com/docs/general-usage/resource#filter-transform-and-pivot-data

I think our merge_key documentation is lacking. We'll try to improve it

@rudolfix rudolfix added enhancement New feature or request community This issue came from slack community workspace labels Mar 15, 2024
@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Mar 15, 2024
@rudolfix rudolfix self-assigned this Mar 18, 2024
@karakanb
Copy link
Contributor

for the sake of cross-referencing, this seems to be the same usecase as the report here: #971 (comment)

@rudolfix
Copy link
Collaborator

@karakanb my learning from linked issue is that we should disable deduplication if merge key is present and primary key is set... which is IMO expected behavior as now the "deduplication" should happen via merge key upstream. that should fix the the issue you describe

@rudolfix
Copy link
Collaborator

rudolfix commented Mar 22, 2024

moved to #1131

@github-project-automation github-project-automation bot moved this from In Progress to Done in dlt core library Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community This issue came from slack community workspace enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

3 participants