[ADAP-910] Support Metadata Freshness #785

peterallenwebb · 2023-09-22T22:15:56Z

Describe the feature

Support metadata-based freshness by implementing the new macro and feature flag described in #8704.

Who will this benefit?

Everyone who wants faster freshness results from Snowflake!

sp-tkerlavage · 2023-09-27T13:07:24Z

Are external tables being tracked as part of this implementation?

I think for external tables the information_schema.tables.last_altered column is the time of the last refresh (or last ALTER), not the last time a file came in.

So for example if an external table has stale data, an ALTER TABLE ... REFRESH is performed but no new files are added (because the pipeline that is dropping fiels into the S3 bucket is broken for example), I think the last_altered column will have the timestamp of when the ALTER TABLE ... REFRESH that was performed. If this will cover external tables and my understanding of the metadata is correct, then this might lead to the dbt metadata reflecting that the external table is fresher than it actually is thus obfuscating an issue (a pipeline being broken)

The directory (if enabled) of the external stage would have the list of files and when they were added to the stage. So pretty much just wondering if this feature will be a behind the scenes query of the information_schema, or if freshness source can be a user defined query like originally outlined here dbt-labs/dbt-core#7012

peterallenwebb · 2023-10-08T23:48:42Z

@sp-tkerlavage Thanks for your insight on this. Our current intention with Snowflake is to implement the metadata option using LAST_ALTERED, but I'll investigate what you are saying about external tables before we complete the feature. In general, the metadata maintained by warehouses/DBMSes comes with a wide array of warnings and caveats, and we're going to have to figure out the right balance between allowing users the convenience of relying on it and warning them about the potential drawbacks.

graciegoheen · 2023-10-09T20:22:22Z

Hi @sp-tkerlavage thanks for raising this! This won't be in-scope for 1.7, but I've opened a new feature request to track this and added it to our Applied State epic.

Would you be able to add an example query you might use to determine source freshness, so we make sure we solve your use case?

dbt-labs/dbt-core#8797

sp-tkerlavage · 2023-10-10T13:51:37Z

In Snowflake an external stage can have a directory which essentially stores a list of files in the external storage (S3 Bucket, etc) along with other useful information like relative_path, url, size, MD5, and last_modified.
https://docs.snowflake.com/en/user-guide/data-load-dirtables

This directory table can be automatically refreshed when files are added via SQS
https://docs.snowflake.com/en/user-guide/data-load-dirtables-auto

The directory is not a standalone object, its just a layer on top of a stage. It can be queried like this:

SELECT relative_path, 
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@database.schema.stage_name' )

However, you would have to parse the relative path to extract the name of the source table. So if for example you have a bucket that is partitioned like this:
<bucket>/source/source_table/yyyy/mm/dd/hh
You would have to do something like this

SELECT relative_path, 
    SPLIT_PART(relative_path, '/', 1) source_database,
    SPLIT_PART(relative_path, '/', 2) source_table,
    SPLIT_PART(relative_path, '/', 3) year
    ...
    last_modified, 
    CONVERT_TIMEZONE('UTC', last_modified) last_modified_utc
FROM DIRECTORY( '@prod_cdc.meta.dms' )

You could then just group by the source_database and source_table to get the max last_modified for each and do the freshness comparison.

Obviously different S3 directory structures would require different parsing.

I actually have an incremental model for this directory parsing logic in my own project, so that may be a cleaner approach. However, I'm not sure if ref'ing a model in a freshness test is even possible. And even if it is, the model would have to be ran first. But again, I'm not entirely clear on the parsing order so it may not even be possible to ref a model in a freshness test on the face of it.

tjirab · 2024-01-19T12:49:16Z

Minor concern: LAST_ALTERED gets updated upon DML & DDL changes, and maintenance ops. It does not necessarily mean fresh data has become available.

mikealfare · 2024-01-19T18:11:16Z

Minor concern: LAST_ALTERED gets updated upon DML & DDL changes, and maintenance ops. It does not necessarily mean fresh data has become available.

Ah, good to know. @Fleid, in the context of dbt-managed objects, how large of an issue do you see this as? Is this something worth addressing with some level of urgency?

mikealfare · 2024-02-06T21:35:48Z

Minor concern: LAST_ALTERED gets updated upon DML & DDL changes, and maintenance ops. It does not necessarily mean fresh data has become available.

This bug has been opened as a separate issue: #899. This issue will remain closed and conversation on it may not necessarily be seen due to status. Please move any new conversation over to the new issue.

peterallenwebb added enhancement New feature or request triage labels Sep 22, 2023

github-actions bot changed the title ~~Support Metadata Freshness~~ [ADAP-910] Support Metadata Freshness Sep 22, 2023

peterallenwebb mentioned this issue Sep 22, 2023

[CT-2144] [Feature] Allow source freshness to be evaluated from table metadata when adapter supports it dbt-labs/dbt-core#7012

Closed

3 tasks

dbeatty10 removed the triage label Sep 24, 2023

graciegoheen assigned mikealfare Sep 26, 2023

peterallenwebb mentioned this issue Oct 9, 2023

Implementation of metadata-based freshness #796

Merged

4 tasks

graciegoheen mentioned this issue Oct 9, 2023

[CT-3194] [Feature] add another way to calculate source freshness - via a SQL query dbt-labs/dbt-core#8797

Closed

3 tasks

peterallenwebb closed this as completed in #796 Oct 12, 2023

dbeatty10 mentioned this issue Dec 8, 2023

loaded_at_field not required for source freshness checks as-of 1.7 dbt-labs/docs.getdbt.com#4615

Closed

1 task

mikealfare mentioned this issue Feb 6, 2024

[Bug] Source metadata freshness is using the wrong field to calculate last updated #899

Closed

2 tasks

graciegoheen mentioned this issue Mar 27, 2024

Improvements to applied state metadata dbt-labs/docs.getdbt.com#5089

Merged

3 tasks

dataders mentioned this issue May 23, 2024

[Feature] Allow snowflake__get_relation_last_modified to references models or sources #1054

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ADAP-910] Support Metadata Freshness #785

[ADAP-910] Support Metadata Freshness #785

peterallenwebb commented Sep 22, 2023 •

edited

Loading

sp-tkerlavage commented Sep 27, 2023 •

edited

Loading

peterallenwebb commented Oct 8, 2023

graciegoheen commented Oct 9, 2023 •

edited

Loading

sp-tkerlavage commented Oct 10, 2023 •

edited

Loading

tjirab commented Jan 19, 2024 •

edited

Loading

mikealfare commented Jan 19, 2024

mikealfare commented Feb 6, 2024

[ADAP-910] Support Metadata Freshness #785

[ADAP-910] Support Metadata Freshness #785

Comments

peterallenwebb commented Sep 22, 2023 • edited Loading

Describe the feature

Who will this benefit?

sp-tkerlavage commented Sep 27, 2023 • edited Loading

peterallenwebb commented Oct 8, 2023

graciegoheen commented Oct 9, 2023 • edited Loading

sp-tkerlavage commented Oct 10, 2023 • edited Loading

tjirab commented Jan 19, 2024 • edited Loading

mikealfare commented Jan 19, 2024

mikealfare commented Feb 6, 2024

peterallenwebb commented Sep 22, 2023 •

edited

Loading

sp-tkerlavage commented Sep 27, 2023 •

edited

Loading

graciegoheen commented Oct 9, 2023 •

edited

Loading

sp-tkerlavage commented Oct 10, 2023 •

edited

Loading

tjirab commented Jan 19, 2024 •

edited

Loading