Replies: 14 comments 6 replies
-
I have a potentially controversial question - I know the “easier” approach is a lift and shift, define everything in a sources yml file. But why not a new node type that could be sql or python? As much as we don’t like to say it in the ELT mindset, transformations do happen at the ingestion point (and this is ingestion). If we want to open up to support things like managed sources (in a way that makes sense and flag the difference between a dbt transformation model vs a managed source as opposed to how Databricks and Materialize had to do it), I think this needs to be a new node type. That's not to say we can't bring over the disciplines of a yml file (i.e clear configurations), but this allows the user to manage with more flexibility to say parse out a json blob, declare the sink. It would make the development experience a lot better more simpler (we all know debugging yml is not that fun). Another reason why I want this to be a new node type is to denote the difference between a managed source vs unmanaged by dbt. As our project grows, easy understanding of dependencies is paramount. |
Beta Was this translation helpful? Give feedback.
-
I think it's a good idea to bring the stable package, Who benefits?Not all teams using dbt use Usage patternsI think we can think of two usage patterns of your idea: (1) running dbt jobs with a machine user on a schedule, and (2) human user developing dbt locally or in Cloud IDE: (1) running dbt jobs with a machine user on a scheduleThis is the most straightforward, I think. In production, in particular, the external tables should be linking to the production source data, unmasked, with necessary read-permissions. It should Just Work, and the simplicity to run (2) human user developing dbt locally or in Cloud IDEThis is where it begins to be less clear to me what should happen with external tables. Do developers have the ability to choose whether their external table is the production source or some dummy/development source? Do developers have the option to set, for example, a seed within the project in lieu of the production source? Perhaps power user teams who understand and utilize dbt-external-tables today have solved this for themselves, but the act of bringing this into dbt-core will bring a wider audience and perhaps need more semantics to handle their needs. Especially larger enterprises, where data access is more controlled. Let me know if you agree here, and my suggestion would be to address this case in the deliverable. |
Beta Was this translation helpful? Give feedback.
-
Posting here per a sidebar discussion with @dataders over on dbt Slack: I love the idea of making external tables part of dbt core since they can be such a powerful part of a data pipeline. However, I can actually imagine two different use cases on very different ends of the spectrum:
|
Beta Was this translation helpful? Give feedback.
-
That's very appealing to me. I have no stake in an external tables discussion as it's doubtful I'd use it, but I do have a lot of uses for non-transformed objects. Additional possible "What's Next"'s that came up in dbt Slack today:
Most importantly, it could be of positive significance to Dave's need summarization: all nodes are equally important.
If it's a new command:
Side note: This concept could eventually grow to include find SQL errors, at least regarding object existence, at times other than run time.
Yes. I think you made that case decisively and it makes sense to my gut. @amychen1776, you did stir up controversy in me. :) To your question of "why not a new node type?", my reactionary, highly critical, gut is saying: could we please stop with new node types already? Why does everything have to be a new node type instead of a first class node? Every time there's a new dbt node type there's a new thing that isn't available in other node types that "should be" and is very reasonably expected to be. That said @amychen1776, my brain is translating an overall summary of what you said into: "take the time to fully/thoughtfully integrate the concept of a dbt-managed vs. not-dbt managed source object with that mental framing instead of taking the external table framing and expanding it to more external object types and to non-transformed objects". I agree with that philosophically, but I'm not qualified to know the ROI/TCO on that or whether a new node type is the best way to accomplish it. Yet, that said, I'm not sure if "dbt-managed" is the right frame either. I think you may be dancing around a concept of "dbt acknowledged" or "dbt usable" or "dbt used", which is significant but different. Overall, this proposal makes sense to me. If I was the one deciding whether to prioritize it for development or not my questions would be:
TLDR: LGTM |
Beta Was this translation helpful? Give feedback.
-
I understand @amychen1776's question as — why should we prefer (1) below (current state with
As long as all the compute is running in the DWH, the actually-running queries are basically the same. So really this is getting at a more philosophical question (which @alison985 picks up below): What's the essential difference between a source and a model? Is it the fact of being "dbt-managed" (materialized by dbt, that is, created/updated/replaced)? Or is it the fact of being "raw" (untransformed) versus bearing some transformation? Are we still as committed as ever to the idea that ingestion and transformation must be separate?
One of these things has got to give. This is the discussion from the very first issue of If they're models, is there another realer I am very open to the idea that I got this wrong in @boxysean and @emilyriederer are both getting at an important question here: how does one develop with external tables. If this thing is a model, any developer should be able to edit and run it in development. That's (intentionally) more difficult for the resources that are meant to be consistent across environments (dev/qa/prod): I don't really buy the similarity between seeds and external tables. Seeds are defined fully within dbt, in version control, alongside the dbt project. The CSV format is really just an ergonomic improvement over writing So what's wrong with external tables as models? The thing I didn't like then, and still don't like now, is the idea that you could have a model with no associated I think I would rather have a dbt-materialized source than a transformation-less model. Either way, we're changing an assumption about how these have existed to date. Here's a syllogism:
None of the above feels outlandish to me — but all of it would require a heavier lift to implement. |
Beta Was this translation helpful? Give feedback.
-
I like the idea. But would love to see some clarificaiton on the supported types of table formats i.e. iceberg, hudi, delta. |
Beta Was this translation helpful? Give feedback.
-
@alison985 @jtcohen6 I agree with what you have said here. Alison - what caught my eye was what you said about the fact that sources are to this day still missing features. I'm of the opinion that by fully supporting it as a materialization/new node type - we will be able to solve that in the long term by giving it the flexibility that we need to support all types of dbt managed sources like those incrementally loaded and streaming (where we never want to run again) by giving it our standard model configurations. I want these managed sources to have the abilities to put in hooks and model governance configurations. This will require a much bigger lift but I think it will keep us away from some tech debt that we already have from the way things were implemented in the package. I should probably call out that in doing a lot of my best practices whitepaper writing for Snowflake, Redshift, BQ, etc --> I have gotten feedback that the external tables package is unpleasant to use because of yml so I think my bias here is avoiding it so I can just write some good ole SQL making it easier to debug than run-operations macros in the logs. |
Beta Was this translation helpful? Give feedback.
-
A lot of great ideas in this thread. An idea that @amychen1776 mentioned above that I'd like to flesh out a bit more that seems to have a lot of promise to me. The notion of "managed" versus "unmanaged" sources. A "managed" source would be a source that dbt has some hand in preparing, ingesting, monitoring, or administering. E.g. an external table. An "unmanaged" source would be a source that just shows up that dbt has no hand in managing nor monitoring. So, other sources right now. I can see other kinds of managed sources in the future. e.g. say dbt somehow has a way to communicate upstream with a streaming service about landing a table in Snowflake, BigQuery, et al. And it can turn the streaming on and off. Or it can periodically purge incoming streaming data from an append-only table. This does start to get into questions about idempotency and the boundaries of dbt, but, let's face it - as regards sources a lot of people are already getting into these questions already or we wouldn't be having these discussions. The nice thing about doing this is that this lets us avoid having to add another node type and set of commands around it. We can just use the Hopefully as part of all of this we could make external table refreshes inherently parallel which was in another ticket that got closed recently... |
Beta Was this translation helpful? Give feedback.
-
Reviving this thread after a great conversation over in the Slack community. One functionality I'd love to see is the ability to refresh the metadata on an external table without completely rebuilding it. So, essentially, some kind of flag that will send an In my case, I'm using external tables as an alternative to managing my own That being said, I'm not opposed to a full refresh especially if passing the |
Beta Was this translation helpful? Give feedback.
-
Throwing in my support here. I hadn't used the |
Beta Was this translation helpful? Give feedback.
-
Adding "managed sources" including external tables and their ancillary objects into Core would be a very useful feature addition. Most of the time these tables are run once and then only again if the configuration changes, making them essentially IaC. Accordingly, the place we've run into the most friction with the external tables package is environment management. Many of the properties are different in development vs stage/prod which leads to a ton of jinja in the yaml file. It's especially difficult to separate active development of the external table and dependencies from later usage. More than once we've ended up with federated copies of a massive Snowpipe, when the developers were only working on downstream models. Ideally it would be possible to define the target schema and database similarly to Snapshots. |
Beta Was this translation helpful? Give feedback.
-
This would be a very useful addition to core. I would also recommend supporting Redshift Federated Queries. While the underlying mechanism of federated queries differs from your typical external table, the behavior of dbt not recognizing the relation is the same. |
Beta Was this translation helpful? Give feedback.
-
Any news on this topic ? 🙏 |
Beta Was this translation helpful? Give feedback.
-
sharing since @azdoherty is too humble :) i read Adam's post (thanks for that!), and now we replaced the package with a pre-hook for all staging models that basically looks like this
this assumes...
|
Beta Was this translation helpful? Give feedback.
-
what
users of dbt should be able to define external tables and have dbt handle their management/creation/updates.
for context, dbt-labs/dbt-core#5099 is a great place to start to understand the use case for dbt managing non-transformation-bearing objects such as external tables.
this feature goes back 4.5 years to #1318 which paved the way for this to be done with dbt-external-tables.
the purpose on this doc is to try and gain consensus on the end-state user interface and start to consider how this will be implemented
why
get_external_build_plan()
what will this look like?
answering the following user-interface questions should help guide and prioritize what refactoring we might like to do. I have some intuition on what I think the UI should look like, but I'll try to withhold it for the time being.
0) how are external tables like seeds? how are they different?
the existing feature of dbt Core with which external tables is most closely aligned is that of seeds, in terms of use case as well as ergonomics. I bring this up to acknowledge that:
1) with what dbt command should external tables be created/updated/deleted?
If a user with an external table defined in
sources.yaml
calls a baredbt run
, is the external table the first node that is executed? or should it be assumed that the external table should already exist.If the external table is not created/refreshed/replaced during
dbt run
, then this another command should be used. This could either remaindbt run_operation stage_external_sources
, or perhaps merit a new command.dbt stage
?dbt provision
dbt setup
.This is how seeds work today, and it's worth calling out that while seeds have been in dbt for a long time, they really aren't "transformation-holding" objects.
Regardless of how the external tables are created, I believe that this operation should certainly be:
dbt run-operation blah blah
dbt build
.2) refresh and idempotency
what happens when dbt is asked to "run" an external table node that already exists?
The dbt-external-tables package already implements a "full refresh" that aligns with that of
dbt seed
. However the ergonomics offered viarun-operation
is not ideal (i.e.--vars "ext_full_refresh: true"
instead of--full-refresh
)refresh
? why not drop and recreate?From my research so far, there's two kinds of refreshes possible:
on_configuration_change
(ie I changed thelocation
config of an external table in mysources.yml
's from what it was when I created the table initiatally)location
I've configuredProvided the underlying data platform supports it, I think both should be possible. Below is a summary of the kinds of alterations possible for commonly used data platforms
external table alterations by data platform
Redshift
In Redshift you can alter an external table to update both the
location
and partition configuration.Redshift allows specifying partitions in external tables for performance reasons. refreshing an external table in Redshift applies to the following scenario
sources.yml
I'm unclear as to the advantage that the
ALTER ... ADD PARTITIONS
clause has over justDROP
&REPLACE
Snowflake
REFRESH
in the context of a Snowflake external table means exhaustively enumerating and caching all the files that exist in the provided in theLOCATION
path parameter provided originally to theCREATE EXTERNAL TABLE
command.So unless there's magic behind the scenes so that the external table can be auto-updated to see new files, the external table must be "refreshed".
There's two related knobs to
REFRESH
that can be set when defining the external table, which both default toTRUE
:REFRESH_ON_CREATE
: ifFalse
, the external table's definition will correspond to zero files and have no dataAUTO_REFRESH
: will refresh whenever the storage location tells Snowflake that there's new/modified files atLOCATION
(but this doesn't work for an external STAGE)Spark
REFRESH TABLE is similar to Snowflakes. You're telling the table "forget everything you remember about what files you're supposed to point to. Go look again and have that be the truth from now on
3) what should happen during a
dbt run
when a modelsource
's an external table that does not (yet) exist in the DWH?should
dbt run
fail in the same way as when a seed is referenced that has yet to be created? ie fail because the database tells dbt that it does not exist?or, should, dbt foresee this error and fail at compile time and tell the user to first create the external table?
Relatedly, dbt has run no awareness/context for external tables as objects. If you
source()
an external table that has not yet been created, you will get a database error, rather than a dbt compilation error. Same goesdbt run
not having context on a source freshness and whether it meets a required SLA.4) should dbt manage objects ancillary to an external tables definition like file formats and external stage?
currently,
dbt-external-tables
expects that following to be already created:Normally these things are infrequently created and modified so it isn't a heavy lift. However, there's opportunity to have attributes of sources themselves be used to create their corresponding external stage. This would obviate the need for YAML anchoring to repeat parameters across multiple external tables.
implementation
the "low-hanging fruit" option is to "lift & shift" dbt-external-tables' macros into dbt Core and the corresponding adapter repos. The end user experience, however, would remain the same. Porting the existing integration tests into dbt Core's pytest framework would also be a requirement of this work.
however, given the below, I think engineering investment is warranted
what next
broaden the scope to include more database objects that are used by models but aren't models themselves. objects such as:
externalactive Nodes Feature Request: External Nodes dbt-core#5073Beta Was this translation helpful? Give feedback.
All reactions