-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797
Comments
|
From @sp-tkerlavage
|
in conversation with @graciegoheen, we decided that we aren't currently aware of enough distinct use cases for custom SQL-defined freshness beyond supporting freshness of external tables in dbt-snowflake. For now, I'm closing this in favor of dbt-labs/dbt-snowflake#1061. If we come across more use cases in the future, we address them as they come up. Perhaps this framing will useful then |
I'm going to re-open as I've heard from a couple folks that they use side control tables to track updates and status of their main large tables. For example:
|
+1 @graciegoheen, that's exactly what our org needs to be able to pull source freshness for some tables that are populated by an upstream team. For context, the jobs that team runs to populate those tables execute once a day, but load the table in "micro loads" of 1-60 records every couple of seconds until the full load has completed. That can cause queries to show a partial dataset if you query the table during the load job, so source freshness against the table itself will tell us that the data has been updated before the load has completed. An audit table (similar to what you posted) is available with metadata; if we could use that to determine the source freshness for the table with a query just like your example we could provide our consumers with much more accurate information, and use the freshness selectors to help protect people from these partial loads. |
This could also be useful to surface partial errors in your data, IN our case we get a lot of data from our IOT nodes. If there is a case where we have an outage on one customer only that customer's data would fall behind. other customer's data would be fine, so the issue wouldn't trigger on source freshness. downstream transformations would then build on the stale data from that one lagging customer and include invalid data. A custom query would allow to check that everyone's data is fresh, and not just that at least someone's data is fresh |
Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#6695 |
Is this your first time submitting a feature request?
Describe the feature
We currently support 2 ways to generate source freshness (via warehouse metadata tables & via a
loaded_at_field
). We should support a 3rd way to generate source freshness - via aloaded_at_query
.Option 1: freshness config added (get freshness from metadata warehouse tables)
Option 2: freshness config added with loaded_at_field (get freshness from select max(loaded_at_field) … from this …)
Option 3: freshness config added with "how does dbt consider a source to be fresh"? (get equivalent of "
max(loaded_at_field)
" from executing a custom query)The built-in one is
collect_freshness
: https://github.com/dbt-labs/dbt-adapters/blob/6c41bedf27063eda64375845db6ce5f7535ef6aa/dbt/include/global_project/macros/adapters/freshness.sql#L4-L16Describe alternatives you've considered
No response
Who will this benefit?
Folks using external tables as sources
Are you interested in contributing this feature?
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: