Time Partitioned Incremental loading #1094

timhiebenthal · 2024-03-14T23:23:38Z

Feature description

Currently you need to specify a primary key to load data incrementally through a merge Disposition.

It's not uncommon - especially with "reporting API's" - that you don't have any specific unique key and also don't need one.
Currency you would need to build a surrogate/primary key to do incremental loads.

How about replacing the data not through the primary_key but through a time period?

Example:

you pull data since 2024-03-01
you delete data in the source since 2024-03-01
you upload the data from step 1 into the source

Adding also a end_date makes sense but I wanted to keep the example simple

Are you a dlt user?

Yes, I run dlt in production.

Use case

Sometimes building the primary key to load incrementally is cumbersome to build (e.g. because in nested dictionaries)
So you need to do more transformations before loading than needed.

Proposed solution

Specify

a partition_type (e.g. "key" or "time") and
a partition_column (the name of the key- or time-column)

"Key" would work as currently.
"Time" would

Identify the min() and max() value of the new increment to be uploaded
Delete everything between min and max in the source
Upload the new increment to the source

Related issues

No response

rudolfix · 2024-03-15T09:51:07Z

@timhiebenthal I think merge_key is doing more or less what you want. https://dlthub.com/docs/general-usage/incremental-loading#merge-incremental-loading

you can use it instead or with primary key to replace partitions of data (ie. days).
for completely custom partitions you can generate a merge column by adding add_map on the resource, you can approximate more granular time ranges ie. updated_at but with hourly resolution let's you replace data with hourly granularity.
https://dlthub.com/docs/general-usage/resource#filter-transform-and-pivot-data

I think our merge_key documentation is lacking. We'll try to improve it

karakanb · 2024-03-20T16:37:45Z

for the sake of cross-referencing, this seems to be the same usecase as the report here: #971 (comment)

rudolfix · 2024-03-22T09:24:05Z

@karakanb my learning from linked issue is that we should disable deduplication if merge key is present and primary key is set... which is IMO expected behavior as now the "deduplication" should happen via merge key upstream. that should fix the the issue you describe

rudolfix · 2024-03-22T10:07:19Z

moved to #1131

github-project-automation bot added this to dlt core library Mar 14, 2024

github-project-automation bot moved this to Todo in dlt core library Mar 14, 2024

rudolfix added enhancement New feature or request community This issue came from slack community workspace labels Mar 15, 2024

rudolfix moved this from Todo to In Progress in dlt core library Mar 15, 2024

rudolfix self-assigned this Mar 18, 2024

rudolfix closed this as completed Mar 22, 2024

github-project-automation bot moved this from In Progress to Done in dlt core library Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Time Partitioned Incremental loading #1094

Time Partitioned Incremental loading #1094

timhiebenthal commented Mar 14, 2024

rudolfix commented Mar 15, 2024

karakanb commented Mar 20, 2024

rudolfix commented Mar 22, 2024

rudolfix commented Mar 22, 2024 •

edited

Loading

Time Partitioned Incremental loading #1094

Time Partitioned Incremental loading #1094

Comments

timhiebenthal commented Mar 14, 2024

Feature description

Are you a dlt user?

Use case

Proposed solution

Related issues

rudolfix commented Mar 15, 2024

karakanb commented Mar 20, 2024

rudolfix commented Mar 22, 2024

rudolfix commented Mar 22, 2024 • edited Loading

rudolfix commented Mar 22, 2024 •

edited

Loading