[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

graciegoheen · 2023-10-09T20:20:44Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I think for external tables the information_schema.tables.last_altered column is the time of the last refresh (or last ALTER), not the last time a file came in.

So for example if an external table has stale data, an ALTER TABLE ... REFRESH is performed but no new files are added (because the pipeline that is dropping fiels into the S3 bucket is broken for example), I think the last_altered column will have the timestamp of when the ALTER TABLE ... REFRESH that was performed. If this will cover external tables and my understanding of the metadata is correct, then this might lead to the dbt metadata reflecting that the external table is fresher than it actually is thus obfuscating an issue (a pipeline being broken)

The directory (if enabled) of the external stage would have the list of files and when they were added to the stage. So pretty much just wondering if this feature will be a behind the scenes query of the information_schema, or if freshness source can be a user defined query like originally outlined here #7012

Originally posted by @sp-tkerlavage in dbt-labs/dbt-snowflake#785 (comment)

We currently support 2 ways to generate source freshness (via warehouse metadata tables & via a loaded_at_field). We should support a 3rd way to generate source freshness - via a loaded_at_query.

Option 1: freshness config added (get freshness from metadata warehouse tables)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day

Option 2: freshness config added with loaded_at_field (get freshness from select max(loaded_at_field) … from this …)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
    loaded_at_field:

Option 3: freshness config added with "how does dbt consider a source to be fresh"? (get equivalent of "max(loaded_at_field)" from executing a custom query)

sources:
  - name: my_source
    freshness:
      warn_after:
        count: 1
        period: day
      error_after:
        count: 5
        period: day
      loaded_at_query: my_freshness_calc # config name TBD

{% freshness my_freshness_calc(source, table) %}

some sql returns a structure similar to our current query

{% endfreshness %}

The built-in one is collect_freshness: https://github.com/dbt-labs/dbt-adapters/blob/6c41bedf27063eda64375845db6ce5f7535ef6aa/dbt/include/global_project/macros/adapters/freshness.sql#L4-L16

Describe alternatives you've considered

No response

Who will this benefit?

Folks using external tables as sources

Are you interested in contributing this feature?

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

graciegoheen · 2023-10-09T20:27:59Z

What happens if you supply both a loaded_at_field and a loaded_at_query
- throw an error?
- use the config "closest" to the source table definition?
We should consider the following case:

version: 2

sources:
  - name: jaffle_shop
    database: raw
    freshness: # default freshness
      warn_after: {count: 12, period: hour}
      error_after: {count: 24, period: hour}
    loaded_at_field: _etl_loaded_at

    tables:
      - name: orders
        loaded_at_query: ...

      - name: customers # this will use the freshness defined above
...

graciegoheen · 2023-10-10T18:37:32Z

From @sp-tkerlavage

In Snowflake an external stage can have a directory which essentially stores a list of files in the external storage (S3 Bucket, etc) along with other useful information like relative_path, url, size, MD5, and last_modified. https://docs.snowflake.com/en/user-guide/data-load-dirtables

This directory table can be automatically refreshed when files are added via SQS https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto

The directory is not a standalone object, its just a layer on top of a stage. It can be queried like this:
SELECT relative_path, 
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@database.schema.stage_name' )
However, you would have to parse the relative path to extract the name of the source table. So if for example you have a bucket that is partitioned like this: <bucket>/source/source_table/yyyy/mm/dd/hh You would have to do something like this
SELECT relative_path, 
    SPLIT_PART(relative_path, '/', 1) source_database,
    SPLIT_PART(relative_path, '/', 2) source_table,
    SPLIT_PART(relative_path, '/', 3) year
    ...
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@prod_cdc.meta.dms' )
I actually have an incremental model for this directory parsing logic in my own project, so that may be a cleaner approach. However, I'm not sure if ref'ing a model in a freshness test is even possible. And even if it is, the model would have to be ran first. But again, I'm not entirely clear on the parsing order so it may not even be possible to ref a model in a freshness test on the face of it.

dataders · 2024-05-28T17:19:59Z

in conversation with @graciegoheen, we decided that we aren't currently aware of enough distinct use cases for custom SQL-defined freshness beyond supporting freshness of external tables in dbt-snowflake.

For now, I'm closing this in favor of dbt-labs/dbt-snowflake#1061. If we come across more use cases in the future, we address them as they come up. Perhaps this framing will useful then

graciegoheen · 2024-11-18T19:47:43Z

I'm going to re-open as I've heard from a couple folks that they use side control tables to track updates and status of their main large tables.

For example:

select max(start_ts)
from audit_tables
where table_name='my_source_table'
and status='completed'

ElBob · 2024-12-06T18:30:58Z

+1 @graciegoheen, that's exactly what our org needs to be able to pull source freshness for some tables that are populated by an upstream team.

For context, the jobs that team runs to populate those tables execute once a day, but load the table in "micro loads" of 1-60 records every couple of seconds until the full load has completed. That can cause queries to show a partial dataset if you query the table during the load job, so source freshness against the table itself will tell us that the data has been updated before the load has completed.

An audit table (similar to what you posted) is available with metadata; if we could use that to determine the source freshness for the table with a query just like your example we could provide our consumers with much more accurate information, and use the freshness selectors to help protect people from these partial loads.

moodgorning · 2024-12-12T08:45:33Z

This could also be useful to surface partial errors in your data, IN our case we get a lot of data from our IOT nodes. If there is a case where we have an outage on one customer only that customer's data would fall behind. other customer's data would be fine, so the issue wouldn't trigger on source freshness. downstream transformations would then build on the stale data from that one lagging customer and include invalid data. A custom query would allow to check that everyone's data is fresh, and not just that at least someone's data is fresh

FishtownBuildBot · 2024-12-19T19:49:35Z

Opened a new issue in dbt-labs/docs.getdbt.com: dbt-labs/docs.getdbt.com#6695

graciegoheen added the enhancement New feature or request label Oct 9, 2023

github-actions bot changed the title ~~[Feature] add another way to calculate source freshness - via a SQL query~~ [CT-3194] [Feature] add another way to calculate source freshness - via a SQL query Oct 9, 2023

This was referenced Oct 9, 2023

[Epic] Applied State (part 1) #8316

Closed

[ADAP-910] Support Metadata Freshness dbt-labs/dbt-snowflake#785

Closed

This was referenced May 23, 2024

[Feature] Allow snowflake__get_relation_last_modified to references models or sources dbt-labs/dbt-snowflake#1054

Closed

[Feature] metadata-based freshness checks for external tables dbt-labs/dbt-snowflake#1061

Closed

dataders closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024

graciegoheen reopened this Nov 18, 2024

ChenyuLInx added the Adaptive Job label Dec 10, 2024

graciegoheen added the user docs [docs.getdbt.com] Needs better documentation label Dec 13, 2024

This was referenced Dec 17, 2024

adapter function for freshness via custom sql dbt-labs/dbt-adapters#384

Merged

Custom SQL for get source maxLoadedAt #11163

Merged

graciegoheen added this to the v1.10 milestone Dec 18, 2024

graciegoheen added Impact: Orch and removed Adaptive Job labels Dec 18, 2024

ChenyuLInx closed this as completed in #11163 Dec 19, 2024

FishtownBuildBot mentioned this issue Dec 19, 2024

[Core] new loaded_at_query config for source freshness - dbt-core Issue #8797 dbt-labs/docs.getdbt.com#6695

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

graciegoheen commented Oct 9, 2023 •

edited

Loading

graciegoheen commented Oct 9, 2023

graciegoheen commented Oct 10, 2023 •

edited

Loading

dataders commented May 28, 2024

graciegoheen commented Nov 18, 2024 •

edited

Loading

ElBob commented Dec 6, 2024

moodgorning commented Dec 12, 2024

FishtownBuildBot commented Dec 19, 2024

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query #8797

Comments

graciegoheen commented Oct 9, 2023 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

graciegoheen commented Oct 9, 2023

graciegoheen commented Oct 10, 2023 • edited Loading

dataders commented May 28, 2024

graciegoheen commented Nov 18, 2024 • edited Loading

ElBob commented Dec 6, 2024

moodgorning commented Dec 12, 2024

FishtownBuildBot commented Dec 19, 2024

graciegoheen commented Oct 9, 2023 •

edited

Loading

graciegoheen commented Oct 10, 2023 •

edited

Loading

graciegoheen commented Nov 18, 2024 •

edited

Loading