Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] dbt microbatch, max event_time alternative to now()-lookback #11129

Open
3 tasks done
szalaj opened this issue Dec 11, 2024 · 0 comments
Open
3 tasks done

[Feature] dbt microbatch, max event_time alternative to now()-lookback #11129

szalaj opened this issue Dec 11, 2024 · 0 comments
Labels
enhancement New feature or request microbatch Issues related to the microbatch incremental strategy triage

Comments

@szalaj
Copy link

szalaj commented Dec 11, 2024

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

description

Currently for microbatch incremental strategy only way to handle latency between now() and real loaded time of data is to use lookback attribute in config.

Assuming there is model with microbatch strategy and batch_size=day & lookback=3 and we load dbt run --full-refresh on 2024-10-07 (all batches till now()) once, and in subsequent days we load only batches for now()-lookback -> that will not cover cases where:
I. Latency for some batch is unexpectedly greater than usual. (ref. D on pic attached)
II. There is a gap in dbt run time (ref. G, H on pic attached)

image

Problem arises also when:
III. There is model getting data from few refs , each one could have different latency

proposed solution

introduce new attribute for config -> max_event_time:True/False', default value False.

before running batches it will get max event times from all sources tables and take min of them. Calculates min_of_max_event_time parameter.

Then it will run batches between calculated min_of_max_event_time-lookback and now()

So lookback attribute could be used in both cases when config max_event_time is set True and False

Note. #10702 is different because it is about only first run (begin)

Describe alternatives you've considered

one workaround is to increase lookback , but in most of the time be waste of time and resources and can't be always 100% accurate

second is to create custom test and if missed event_time is detected run it using --event-time-start & --event-time-end flags. This introduce though troublesome additional maintenance time.

Who will this benefit?

Teams which use big tables with random latency time for loading data & they want minimize maintenance time.

Are you interested in contributing this feature?

yes

Anything else?

No response

@szalaj szalaj added enhancement New feature or request triage labels Dec 11, 2024
@szalaj szalaj changed the title [Feature] dbt microbatch, min_event_time alternative to lookback [Feature] dbt microbatch, min_event_time alternative to now()-lookback Dec 11, 2024
@szalaj szalaj changed the title [Feature] dbt microbatch, min_event_time alternative to now()-lookback [Feature] dbt microbatch, max_event_time alternative to now()-lookback Dec 11, 2024
@szalaj szalaj changed the title [Feature] dbt microbatch, max_event_time alternative to now()-lookback [Feature] dbt microbatch, max event_time alternative to now()-lookback Dec 11, 2024
@ernestoongaro ernestoongaro added the microbatch Issues related to the microbatch incremental strategy label Dec 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request microbatch Issues related to the microbatch incremental strategy triage
Projects
None yet
Development

No branches or pull requests

2 participants