making dbt-external-tables internal to Core #92

dataders · 2023-09-11T17:31:07Z

dataders
Sep 11, 2023
Collaborator

what

users of dbt should be able to define external tables and have dbt handle their management/creation/updates.

for context, dbt-labs/dbt-core#5099 is a great place to start to understand the use case for dbt managing non-transformation-bearing objects such as external tables.

this feature goes back 4.5 years to #1318 which paved the way for this to be done with dbt-external-tables.

the purpose on this doc is to try and gain consensus on the end-state user interface and start to consider how this will be implemented

why

dbt-external-tables is one of the most used packages in dbt, and we don't foresee this changing anytime soon.
The lake house model of a query engine on top of external storage is here to stay.
Like seeds, adding an external table to be used in a dbt project is quite common, and it is preferable to adding in source data directly to a report or dashboard
help to deliver a comprehensive streaming solution by allowing dbt to manage streaming sources Managed sources: adding DDL capabilities to sources dbt-core#7145
refactoring as part of delivering Materialized Views and Dynamic Tables introduced a path for implementing get_external_build_plan()

what will this look like?

answering the following user-interface questions should help guide and prioritize what refactoring we might like to do. I have some intuition on what I think the UI should look like, but I'll try to withhold it for the time being.

0) how are external tables like seeds? how are they different?

the existing feature of dbt Core with which external tables is most closely aligned is that of seeds, in terms of use case as well as ergonomics. I bring this up to acknowledge that:

perhaps it's unhelpfully reductive to equate external tables as similar to seeds
one consequence of this work might be opportunities to re-think how seeds are handled (in addition to the laundry list of other non-transforming objects)

1) with what dbt command should external tables be created/updated/deleted?

If a user with an external table defined in sources.yaml calls a bare dbt run, is the external table the first node that is executed? or should it be assumed that the external table should already exist.

If the external table is not created/refreshed/replaced during dbt run, then this another command should be used. This could either remain dbt run_operation stage_external_sources, or perhaps merit a new command. dbt stage? dbt provision dbt setup.

This is how seeds work today, and it's worth calling out that while seeds have been in dbt for a long time, they really aren't "transformation-holding" objects.

Regardless of how the external tables are created, I believe that this operation should certainly be:

more ergonomic than dbt run-operation blah blah
included as part of dbt build.

2) refresh and idempotency

what happens when dbt is asked to "run" an external table node that already exists?

The dbt-external-tables package already implements a "full refresh" that aligns with that of dbt seed. However the ergonomics offered via run-operation is not ideal (i.e. --vars "ext_full_refresh: true" instead of --full-refresh)

`refresh`? why not drop and recreate?

From my research so far, there's two kinds of refreshes possible:

on_configuration_change (ie I changed the location config of an external table in my sources.yml's from what it was when I created the table initiatally)
go check to see if there's new or changed files in the location I've configured

Provided the underlying data platform supports it, I think both should be possible. Below is a summary of the kinds of alterations possible for commonly used data platforms

external table alterations by data platform

Redshift

In Redshift you can alter an external table to update both the location and partition configuration.

Redshift allows specifying partitions in external tables for performance reasons. refreshing an external table in Redshift applies to the following scenario

An external table is defined in dbt to:
1. point to a directory of files
2. be be date-partiioned in sources.yml
the partition spec is modified, but external table was already created

I'm unclear as to the advantage that the ALTER ... ADD PARTITIONS clause has over just DROP & REPLACE

Snowflake

REFRESH in the context of a Snowflake external table means exhaustively enumerating and caching all the files that exist in the provided in the LOCATION path parameter provided originally to the CREATE EXTERNAL TABLE command.

So unless there's magic behind the scenes so that the external table can be auto-updated to see new files, the external table must be "refreshed".

There's two related knobs to REFRESH that can be set when defining the external table, which both default to TRUE:

REFRESH_ON_CREATE: if False, the external table's definition will correspond to zero files and have no data
AUTO_REFRESH: will refresh whenever the storage location tells Snowflake that there's new/modified files at LOCATION (but this doesn't work for an external STAGE)

Spark

REFRESH TABLE is similar to Snowflakes. You're telling the table "forget everything you remember about what files you're supposed to point to. Go look again and have that be the truth from now on

3) what should happen during a `dbt run` when a model `source`'s an external table that does not (yet) exist in the DWH?

should dbt run fail in the same way as when a seed is referenced that has yet to be created? ie fail because the database tells dbt that it does not exist?

or, should, dbt foresee this error and fail at compile time and tell the user to first create the external table?

Relatedly, dbt has run no awareness/context for external tables as objects. If you source() an external table that has not yet been created, you will get a database error, rather than a dbt compilation error. Same goes dbt run not having context on a source freshness and whether it meets a required SLA.

4) should dbt manage objects ancillary to an external tables definition like file formats and external stage?

currently, dbt-external-tables expects that following to be already created:

external stage or external schema
file formats

Normally these things are infrequently created and modified so it isn't a heavy lift. However, there's opportunity to have attributes of sources themselves be used to create their corresponding external stage. This would obviate the need for YAML anchoring to repeat parameters across multiple external tables.

implementation

the "low-hanging fruit" option is to "lift & shift" dbt-external-tables' macros into dbt Core and the corresponding adapter repos. The end user experience, however, would remain the same. Porting the existing integration tests into dbt Core's pytest framework would also be a requirement of this work.

however, given the below, I think engineering investment is warranted

the opportunity to improve and build upon existing UI, and
recent adapter zone improvements to how relations are created and replaced (see #8449)
extend the new pattern to include other non-transform-bearing objects (see below "what's next") section

what next

broaden the scope to include more database objects that are used by models but aren't models themselves. objects such as:

managed sources Managed sources: adding DDL capabilities to sources dbt-core#7145
UDFs Manage data warehouse objects (other than tables and views) dbt-core#5099 would also unblock users of Python models to have a something more closely resembling SQL macros.
Dynamic Data Masking 531#issuecomment-1700746672
~~external~~ active Nodes Feature Request: External Nodes dbt-core#5073

amychen1776 · 2023-09-14T13:47:42Z

amychen1776
Sep 14, 2023
Maintainer

I have a potentially controversial question - I know the “easier” approach is a lift and shift, define everything in a sources yml file. But why not a new node type that could be sql or python?

As much as we don’t like to say it in the ELT mindset, transformations do happen at the ingestion point (and this is ingestion). If we want to open up to support things like managed sources (in a way that makes sense and flag the difference between a dbt transformation model vs a managed source as opposed to how Databricks and Materialize had to do it), I think this needs to be a new node type. That's not to say we can't bring over the disciplines of a yml file (i.e clear configurations), but this allows the user to manage with more flexibility to say parse out a json blob, declare the sink. It would make the development experience a lot better more simpler (we all know debugging yml is not that fun).

Another reason why I want this to be a new node type is to denote the difference between a managed source vs unmanaged by dbt. As our project grows, easy understanding of dependencies is paramount.

1 reply

geoHeil Oct 3, 2023

I disagree here and would argue that this should belog to a proper orchestrator like Airflow or even better Dagster which integrates with DBT and runs the python steps before DBT will run the SQL. And DBT therefore should stick to its yaml approach.

boxysean · 2023-09-15T12:00:58Z

boxysean
Sep 15, 2023
Collaborator

I think it's a good idea to bring the stable package, dbt-external-tables, into dbt-core to primarily to integrate and create first-class experiences of managing these objects within dbt. Let's zoom out on the "why" part of your post, @dataders.

Who benefits?

Not all teams using dbt use dbt-external-tables. External tables management is an org-by-org decision: sometimes it's the Analytics Engineers using dbt, sometimes it's DBAs who are accustomed to managing the database. I think the benefit of making dbt-external-tables part of dbt-core is that that it helps teams ship data products faster by enabling the dbt teams to manage this part of the database themselves, and easily as you suggest with dbt build. We can't expect it to be adopted by all, but it will be available to all.

Usage patterns

I think we can think of two usage patterns of your idea: (1) running dbt jobs with a machine user on a schedule, and (2) human user developing dbt locally or in Cloud IDE:

(1) running dbt jobs with a machine user on a schedule

This is the most straightforward, I think. In production, in particular, the external tables should be linking to the production source data, unmasked, with necessary read-permissions. It should Just Work, and the simplicity to run dbt build will Do The Right Thing. Easy to setup, easy to utilize.

(2) human user developing dbt locally or in Cloud IDE

This is where it begins to be less clear to me what should happen with external tables. Do developers have the ability to choose whether their external table is the production source or some dummy/development source? Do developers have the option to set, for example, a seed within the project in lieu of the production source?

Perhaps power user teams who understand and utilize dbt-external-tables today have solved this for themselves, but the act of bringing this into dbt-core will bring a wider audience and perhaps need more semantics to handle their needs. Especially larger enterprises, where data access is more controlled.

Let me know if you agree here, and my suggestion would be to address this case in the deliverable.

1 reply

geoHeil Oct 3, 2023

This step (triggering the load /staging of the external table should happend in GIT via a CI/CD pipeline and not manually

emilyriederer · 2023-09-22T00:26:50Z

emilyriederer
Sep 22, 2023

Posting here per a sidebar discussion with @dataders over on dbt Slack:

I love the idea of making external tables part of dbt core since they can be such a powerful part of a data pipeline. However, I can actually imagine two different use cases on very different ends of the spectrum:

"Set it and forget it": I love the idea posed here of dbt stage to acknowledge the existence of a lot of artifacts like seeds, external tables, (and sort of sources, too) that do not need to be run often but should be part of the DAG. Having external tables part of the core package just like seeds and sources really helps emphasize the conceptual similarity of what does/doesn't need to be "touched" frequently and where "out-of-scope" upstream dependencies are defined
At the completely other end of the spectrum, I actually wonder if there could be more specialized functionality around external tables to make them for efficient, e.g. letting people materialize them incrementally. For example, if I know my team will query the same external table many times a day, maybe I'd rather lower my overall compute by materializing it to a table once daily? Or, even better if I only re-materialize it if I can detect that new Parquet files are in the bucket since those I last incrementally read and loaded. Perhaps that's overcomplicating, but I think poorly implemented external tables can be a classic source of tech debt: easy, breasy ETL to set up but inefficient as your data grows and consumption patterns change. Just like dbt helps other parts of pipelines "grow up" (e.g. relatively easy to switch from full/incremental materializations), this might be another easy onramp to better overall practices (That said, @dataders shared some cool examples of work being done by Databricks, Fivetran, and others on the latter. )

0 replies

alison985 · 2023-09-23T02:25:40Z

alison985
Sep 23, 2023

opportunities to re-think how seeds are handled (in addition to the laundry list of other non-transforming objects)

That's very appealing to me. I have no stake in an external tables discussion as it's doubtful I'd use it, but I do have a lot of uses for non-transformed objects. Additional possible "What's Next"'s that came up in dbt Slack today:

From me: how to use a model as a model in one database connection and as a source in a different database connection without redefining it, overwriting a macro, experiencing the target_schema_custom_schema append problem, etc.
From someone else: materialize based on source.yml code. Specifically, with dbt clone.

Most importantly, it could be of positive significance to Dave's need summarization: all nodes are equally important.

or perhaps merit a new command. dbt stage? dbt provision dbt setup.

If it's a new command:

Please don't call it dbt stage as that will add confusion with the existing uses of staging(e.g. environment, transform phase)
Please don't use dbt setup. It's arguably too broad a term for something this specific. It has larger potential to be used for more significant(broader) commands in the future. It's also very close conceptually to dbt init. In addition, why would you run project setup as part of a daily build?
dbt provision has potential.
What I think you're trying to say is "make sure these source configurations are there for me, if not create them". I said "source configurations" instead of tables, objects, or external tables because you brought up that this could contribute to support streaming data sources. If that's what you're trying to say, then it occurs to me it would be a "dbt preflight". You're basically talking about doing pre-build maintenance. The analogy context is: "run", "build", "setup", "seed", "snapshot", "test", "source". These words were probably chosen as if to build a project, but they're also applicable to being an Olympic runner [not writing out how this makes sense to me for negative time ROI]. It's a kind of "turn the lights on" step. So given, step, run, and build are the closest connected keywords and preflight would take us into another analogy(planes/flight), what about dbt warmup? As in, let's get ready to run. You don't warm-up one time, you warm-up before every run.

Side note: This concept could eventually grow to include find SQL errors, at least regarding object existence, at times other than run time.

should dbt run fail in the same way as when a seed is referenced that has yet to be created?

Yes. I think you made that case decisively and it makes sense to my gut.

@amychen1776, you did stir up controversy in me. :) To your question of "why not a new node type?", my reactionary, highly critical, gut is saying: could we please stop with new node types already? Why does everything have to be a new node type instead of a first class node? Every time there's a new dbt node type there's a new thing that isn't available in other node types that "should be" and is very reasonably expected to be. sources are constantly and forever missing features. I keep citing Dave's encapsulation of a solution to a multi-pronged problem: all nodes are equally important/should have equal functionality. I've been thinking about this recently as: a database table is a database table is a database table is a node is a node is a node and then there's a sub-categorization like ~sub-classes. It is a database object I call with a name and get a result from. If there's a new concept, it should be a new sub-class with a new name and you can add or subtract features from it if you need to, but let it inherit all the widely applicable stuff, and the future features nodes get, by default. There's a way of viewing what I'm hearing in this thread as a materialization config change. materialized: table, materialized: external_table, materialized: streaming, materialized: source, materialized: stored_procedure, materialized: view, etc. Then make what YAML each materialization type requires be different config keywords. Yes, I know it's too late for that, among many other things. But I definitely don't think node type is needed to understand lineage and every new node type seems like a new pile of tech debt from where I sit.

That said @amychen1776, my brain is translating an overall summary of what you said into: "take the time to fully/thoughtfully integrate the concept of a dbt-managed vs. not-dbt managed source object with that mental framing instead of taking the external table framing and expanding it to more external object types and to non-transformed objects". I agree with that philosophically, but I'm not qualified to know the ROI/TCO on that or whether a new node type is the best way to accomplish it. Yet, that said, I'm not sure if "dbt-managed" is the right frame either. I think you may be dancing around a concept of "dbt acknowledged" or "dbt usable" or "dbt used", which is significant but different.

Overall, this proposal makes sense to me. If I was the one deciding whether to prioritize it for development or not my questions would be:

How tied to, say, a Redshift Spectrum table/concept is it compared to all the places we imagine a non-transformed object concept can go? <----- This is the most important one because it has the highest long-term ownership cost. @dataders doesn't list is as a consideration, so I'll assume this isn't an issue.
Are we taking on high a TCO item in core as more and more "external table" options happen and change? I'm hearing this is an existing, hardened module, so no.
If we never pursue "what's next?" would taking this into core be worth it? Probably not.
Can we accomplish "what's next?" without - for better or worse - using the external table framing to save dev time to launch? Given recent dbt labs launches: No.
Can we ignore the "what's next?" scenarios? No.

TLDR: LGTM

0 replies

jtcohen6 · 2023-10-02T12:13:24Z

jtcohen6
Oct 2, 2023
Maintainer

I understand @amychen1776's question as — why should we prefer (1) below (current state with dbt-external-tables) over option (2), which lets sources do transformation too?

Source with pointer to external files → model (probably incremental) with SQL/Python transformation →
Source with pointer to external files and SQL/Python transformation → staging model

As long as all the compute is running in the DWH, the actually-running queries are basically the same. So really this is getting at a more philosophical question (which @alison985 picks up below): What's the essential difference between a source and a model? Is it the fact of being "dbt-managed" (materialized by dbt, that is, created/updated/replaced)? Or is it the fact of being "raw" (untransformed) versus bearing some transformation? Are we still as committed as ever to the idea that ingestion and transformation must be separate?

status quo	Model	Source
Materialized?	dbt creates/updates/replaces	dbt just knows where to look
Transformed?	Always, even if just `select * from <other>`	Never

One of these things has got to give.

This is the discussion from the very first issue of dbt-external-tables: dbt-labs/dbt-external-tables#1. If I want to create/update an external table, is that thing a model (with a different materialization), or is it a source?

If they're models, is there another realer source lurking behind the external tables — the Snowflake stage, the file path in S3/GCS/Blob/etc? If they're models, do some of our problems just disappear? No new command, no new set of gestures for on-change-change refreshes. Why shouldn't models support freshness, anyway?

I am very open to the idea that I got this wrong in dbt-external-tables. If I did, now is the time to make it right!

@boxysean and @emilyriederer are both getting at an important question here: how does one develop with external tables. If this thing is a model, any developer should be able to edit and run it in development. That's (intentionally) more difficult for the resources that are meant to be consistent across environments (dev/qa/prod): sources and snapshots. But that assumption does not carry the same water at an organization that, for security/regulatory reasons, must use genuinely different source data (hashed/sampled/etc) across different environments. And even if you could rebuild a model, it doesn't mean you should. As dbt projects get bigger, we're also striving to make "deferral" more readily available in dbt Cloud development environments, so that "don't run what I haven't changed" becomes second nature.

I don't really buy the similarity between seeds and external tables. Seeds are defined fully within dbt, in version control, alongside the dbt project. The CSV format is really just an ergonomic improvement over writing select 1 as id, 'blue' as color union all select 2 , 'red' .... Of all the node types to consider consolidating, I'm almost sympathetic to the argument that seeds are really just a model with language: csv.

So what's wrong with external tables as models? The thing I didn't like then, and still don't like now, is the idea that you could have a model with no associated select statement. In order to support dynamic schema inference on several data platforms, you really just want to point an external table config to a file path. That's it. No column spec, and definitely no select statement defining a transformation.

I think I would rather have a dbt-materialized source than a transformation-less model. Either way, we're changing an assumption about how these have existed to date.

Here's a syllogism:

external_table should really be a materialization, responsible for creating/refreshing a relation. (As @dataders rightly points out, we've been doing some good work to improve our adapter building blocks for managing relations)
This external_table materialization should take the place of get_external_build_plan()
A source could have a materialized config (default none) — just as seeds, snapshots, and tests all have materializations. Those materializations should support such standard constructs as grants, query tags, hooks, ...
If a source is "materialized", dbt understands that it's "dbt-managed," and would need to participate in caching/cataloging accordingly.
The materialization of sources would be triggered by some command like dbt source refresh — just as models are materialized by run, tests by test, snapshots by snapshot, seeds by seed. Certainly, source materializations ought to be included in dbt build — even if build doesn't currently include dbt source freshness (even if I still don't buy the use case for skip-downstream-on-stale-freshness.) Crucially, such a command would also track well for a future where dbt source refresh could be responsible for more than just DDL/DML in the DWH.
It would be hypothetically possible for someone to configure a model with materialized: external_table, too. And why not? Even if it's not the opinionated practice we land on, I don't see harm in the possibility.

None of the above feels outlandish to me — but all of it would require a heavier lift to implement.

0 replies

geoHeil · 2023-10-03T09:01:12Z

geoHeil
Oct 3, 2023

I like the idea.

But would love to see some clarificaiton on the supported types of table formats i.e. iceberg, hudi, delta.

1 reply

amychen1776 Oct 4, 2023
Maintainer

If we move to a materialization model - that would be up to the data platform and what they support. Such as Databricks can read iceberg but could only create delta tables.

amychen1776 · 2023-10-04T12:59:29Z

amychen1776
Oct 4, 2023
Maintainer

@alison985 @jtcohen6 I agree with what you have said here.

Alison - what caught my eye was what you said about the fact that sources are to this day still missing features. I'm of the opinion that by fully supporting it as a materialization/new node type - we will be able to solve that in the long term by giving it the flexibility that we need to support all types of dbt managed sources like those incrementally loaded and streaming (where we never want to run again) by giving it our standard model configurations. I want these managed sources to have the abilities to put in hooks and model governance configurations.

This will require a much bigger lift but I think it will keep us away from some tech debt that we already have from the way things were implemented in the package. I should probably call out that in doing a lot of my best practices whitepaper writing for Snowflake, Redshift, BQ, etc --> I have gotten feedback that the external tables package is unpleasant to use because of yml so I think my bias here is avoiding it so I can just write some good ole SQL making it easier to debug than run-operations macros in the logs.

0 replies

codigo-ergo-sum · 2023-10-09T19:19:22Z

codigo-ergo-sum
Oct 9, 2023

A lot of great ideas in this thread. An idea that @amychen1776 mentioned above that I'd like to flesh out a bit more that seems to have a lot of promise to me. The notion of "managed" versus "unmanaged" sources. A "managed" source would be a source that dbt has some hand in preparing, ingesting, monitoring, or administering. E.g. an external table. An "unmanaged" source would be a source that just shows up that dbt has no hand in managing nor monitoring. So, other sources right now.

I can see other kinds of managed sources in the future. e.g. say dbt somehow has a way to communicate upstream with a streaming service about landing a table in Snowflake, BigQuery, et al. And it can turn the streaming on and off. Or it can periodically purge incoming streaming data from an append-only table. This does start to get into questions about idempotency and the boundaries of dbt, but, let's face it - as regards sources a lot of people are already getting into these questions already or we wouldn't be having these discussions.

The nice thing about doing this is that this lets us avoid having to add another node type and set of commands around it. We can just use the dbt source command, which already exists with only one subcommand (freshness). So we could add another subcommand, something like refresh which was already suggested above. 100% agree that source nodes should be dbt build aware. Default behavior when running without the --full-refresh parameter would be to not do a full refresh.

Hopefully as part of all of this we could make external table refreshes inherently parallel which was in another ticket that got closed recently...

0 replies

trevorwallis13 · 2024-03-26T22:46:15Z

trevorwallis13
Mar 26, 2024

Reviving this thread after a great conversation over in the Slack community. One functionality I'd love to see is the ability to refresh the metadata on an external table without completely rebuilding it. So, essentially, some kind of flag that will send an ALTER EXTERNAL TABLE ... REFRESH call on tables where auto_refresh is set to false.

In my case, I'm using external tables as an alternative to managing my own COPY INTO statements. I'm not necessarily using it in a streaming/data lake functionality, so I don't have an event notification set up to enable Snowflake to refresh automatically. Instead, I just run an on_run_start operation to refresh the metadata and ensure that all files in the bucket are available.

That being said, I'm not opposed to a full refresh especially if passing the ext_full_refresh var was more intuitive. I personally feel like passing variables through the CLI is pretty messy and try to avoid it when possible.

1 reply

trevorwallis13 Mar 26, 2024

I'm on Snowflake, btw.

azdoherty · 2024-04-09T13:07:36Z

azdoherty
Apr 9, 2024

Throwing in my support here. I hadn't used the dbt-external-tables initially, so for each file we needed to mount as a table I wrote a model and a pre-hook that created an external table. That executed in parallel nicely, but it looked a little yucky and you could not visualize the external tables in the dag

2 replies

dataders May 9, 2024
Collaborator Author

@azdoherty I'd love to see this macro you have!

azdoherty May 10, 2024

Its nothing too fancy, just a pre-hook with a create or replace external table ... statement in it and then you can reference the this variable to get the schema and use project variables to pass the file names.

jtmcn · 2024-04-12T23:23:32Z

jtmcn
Apr 12, 2024

Adding "managed sources" including external tables and their ancillary objects into Core would be a very useful feature addition.

Most of the time these tables are run once and then only again if the configuration changes, making them essentially IaC. Accordingly, the place we've run into the most friction with the external tables package is environment management. Many of the properties are different in development vs stage/prod which leads to a ton of jinja in the yaml file. It's especially difficult to separate active development of the external table and dependencies from later usage. More than once we've ended up with federated copies of a massive Snowpipe, when the developers were only working on downstream models. Ideally it would be possible to define the target schema and database similarly to Snapshots.

0 replies

ronco · 2024-05-10T15:19:31Z

ronco
May 10, 2024

This would be a very useful addition to core. I would also recommend supporting Redshift Federated Queries. While the underlying mechanism of federated queries differs from your typical external table, the behavior of dbt not recognizing the relation is the same.

0 replies

kev-datams · 2024-08-05T21:40:18Z

kev-datams
Aug 5, 2024

Any news on this topic ? 🙏

0 replies

data-blade · 2024-09-19T12:07:20Z

data-blade
Sep 19, 2024

Throwing in my support here. I hadn't used the dbt-external-tables initially, so for each file we needed to mount as a table I wrote a model and a pre-hook that created an external table. That executed in parallel nicely, but it looked a little yucky and you could not visualize the external tables in the dag

sharing since @azdoherty is too humble :)

i read Adam's post (thanks for that!), and now we replaced the package with a pre-hook for all staging models that basically looks like this

{% macro stage_hook() %}
	{% if execute %}
		{% set src = graph.nodes['model.<your_project>.'~this.name].depends_on.nodes | first %}
		{% set src_location = graph.sources[src].external.location %}
		{% set src_name = graph.sources[src].name %}
		{% set query %}
			create table if not exists <your_staging_schema>.{{ src_name }} location '{{ src_location }}'
		{% endset %}
	{% endif %}
	{% do run_query(query) %}
	{% do run_query('refresh table <your_staging_schema>.' ~ src_name) %}
{% endmacro %}

this assumes...

staging models only do select ... from source and nothing else
your have external.location set in your source.yml, just as you'd do for the package

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

making dbt-external-tables internal to Core #92

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Redshift

Snowflake

Spark

Replies: 14 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

making dbt-external-tables internal to Core #92

dataders Sep 11, 2023 Collaborator

what

why

what will this look like?

0) how are external tables like seeds? how are they different?

1) with what dbt command should external tables be created/updated/deleted?

2) refresh and idempotency

refresh? why not drop and recreate?

Redshift

Snowflake

Spark

3) what should happen during a dbt run when a model source's an external table that does not (yet) exist in the DWH?

4) should dbt manage objects ancillary to an external tables definition like file formats and external stage?

implementation

what next

Replies: 14 comments · 6 replies

amychen1776 Sep 14, 2023 Maintainer

boxysean Sep 15, 2023 Collaborator

Who benefits?

Usage patterns

(1) running dbt jobs with a machine user on a schedule

(2) human user developing dbt locally or in Cloud IDE

jtcohen6 Oct 2, 2023 Maintainer

amychen1776 Oct 4, 2023 Maintainer

amychen1776 Oct 4, 2023 Maintainer

dataders May 9, 2024 Collaborator Author

dataders
Sep 11, 2023
Collaborator

`refresh`? why not drop and recreate?

3) what should happen during a `dbt run` when a model `source`'s an external table that does not (yet) exist in the DWH?

Replies: 14 comments 6 replies

amychen1776
Sep 14, 2023
Maintainer

boxysean
Sep 15, 2023
Collaborator

jtcohen6
Oct 2, 2023
Maintainer

amychen1776 Oct 4, 2023
Maintainer

amychen1776
Oct 4, 2023
Maintainer

dataders May 9, 2024
Collaborator Author