-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) #613
Conversation
👷 Deploy Preview for amazing-pothos-a3bca0 processing.
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #613 +/- ##
==========================================
+ Coverage 93.21% 93.25% +0.03%
==========================================
Files 53 53
Lines 2079 2091 +12
==========================================
+ Hits 1938 1950 +12
Misses 141 141
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @david-mag !
Thank you for bringing up the issue and promptly writing a PR to fix it.
I find this a very interesting problem, and I agree that:
- We must solve the problem of tests that depend on multiple dbt nodes, without raising an error
- We should allow users to customize tests
--indirect-selection
flag
What I'm not convinced is:
a) That we should set --indirect-selection=buildable
by default.
b) That using --indirect-selection=buildable
would give us an accurate DAG representation of what is going on.
Given the problem you described:
with a simple DAG like this model1 -> model2, and a test rc_test set up to test row-count being equal between model1 and model2
Which would make sense from a DAG topology perspective?
i) To have a single Airflow test task depending on both models' Airflow run both model tasks.
ii) To have two Airflow test nodes - one running for each model run task - but one may not be running all the tests for that given model, but it will still be considered a successful Airflow task, even if that was the only test being run for that specific model.
iii) To have three Airflow test nodes - one running tests that are exclusive to model1 (we could continue using the TaskGroup approach we have, in this sense), other running tests exclusive to model2, and a third Airflow task running test that depends on both models.
With the solution in the current PR, we'd have (ii), but I believe that (iii) is the most accurate approach to solving the problem.
Regarding having (a) as default, my main concern is that this is not the default behaviour in dbt itself - and I believe there is a good reason. It may hide underlying issues.
To implement (a), we could expose the indirect_selection
flag in a similar way to how we're handling full_refresh
:
https://github.com/astronomer/astronomer-cosmos/blob/main/cosmos/operators/local.py#L112
I would love to hear your thoughts on these points before we agree on a way to move forward.
I guess there is a discussion to be had on the DAG topology, and (iii) would make the most sense from that perspective for sure. However, the way I see it, this would put another layer of complexity to the DAG/graph creation that I'm not sure is warranted atm (maybe it will be in the future). As an example: in the As for b): in the rather complicated dbt project I'm working with, when running As to why dbt uses That being said, for me the cosmos default of |
I agree with @tatiana as well that we should not make the buildable as default since not many of us are familiar with the indirect selection and if the default mode is diff from the dbt trst default then it will cause more confusion. But i do agree with adding a flag to state what kind of test you want to run for model1 and for model2. |
Honestly, I believe for ppl that run into the same problem I did, it will be more confusing not to have It took me a minute to really figure out what was going on. For ppl just switching to cosmos, encountering an error for a test that never failed before, will reduce their trust in cosmos itself. I think the way airflow forces us to run model by model really makes defaulting to |
This is a great discussion! @tatiana and I talked and I think we have a plan:
@david-mag thank you for opening this PR so quickly with such a great description and justification! The only change I'd make is instead of manually setting the mode to buildable, can we make this a user-specified, optional parameter? For current use cases, then, nothing would change; but you could modify your DAG to include: ...,
indirect_selection="buildable", Do you also mind throwing together a quick docs page in the configuration section? Either in the testing behavior page or an independent page, but it doesn't need to be anything fancy - can just link out to the dbt docs page you linked in the PR. I'm also happy to do this part if you'd prefer. Great work! |
Yeah, I can expose the setting. Where do you think would be the best place? In the |
Hmm I'd start with |
…-cosmos into buildable_dbt_tests
@jlaneve I've updated the PR, but I think I'll take you up on your offer on updating the docs ^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @david-mag , thank you very much for addressing the feedback so quickly!
Since this is a bug fix, that was blocking you from using Cosmos, we'll release this as part of 1.2.1.
We'll add the necessary docs. We'll merge once the tests pass.
@david-mag it seems some checks are failing, could you fix them, please?
https://github.com/astronomer/astronomer-cosmos/actions/runs/6579169038/job/17893989074?pr=613 |
Will have a look. I didn't get integration tests working locally yet though, is there an easy setup for them? |
integration test should be fixed now, wasn't able to confirm it locally though. |
@david-mag all tests are passing! 🎉 Here is an example of how the test could be done: astronomer-cosmos/tests/operators/test_local.py Lines 330 to 347 in b7381c8
We're aiming to have the 1.2.1 release with this change either today or tomorrow |
just did :), I hope that covers this ^^ |
…ect-selection buildable) (#613) When running tests with dbt, dbt allows users to run them in different [modes](https://docs.getdbt.com/reference/node-selection/test-selection-examples#indirect-selection). As described in the linked dbt documentation: > The "buildable", "cautious", and "empty" modes can be useful in environments when you're only building a subset of your DAG, and you want to avoid test failures in "eager" mode caused by unbuilt resources. Exactly this is the case when using cosmos when running tests with `AFTER_EACH` behaviour. When a tests depends on multiple models, a test can result in an error, because the dependent model was not run yet. Example: with a simple DAG like this `model1 -> model2`, and a test `rc_test` set up to test row-count being equal between `model1` and `model2` an error might occur when we run operations in the following order: 1. `dbt run --model model1` 2. `dbt test --model model1` -> this will result in an error, if the row count has changed between the last and this run. adding the argument `--indirect-selection buildable` in step 2 will avoid actually executing the test there, because dbt makes sure to run the test only if *all* dependent models are upstream. So we can continue without an error: 3. `dbt run --model model2` 4. `dbt test --model model2 --indirect-selection buildable` -> this will actually run the test `rc_test` in this step. By adding `--indirect-selection buildable` we have avoided running the test `rc_test` twice (in both the `model1` DagGroup and `model2` DagGroup), running it only once in `model2`. (cherry picked from commit 1bf3a3e)
Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 Others * Update contributing guide docs by @raffifu in #591 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621
Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 Others * Update contributing guide docs by @raffifu in #591 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621 (cherry picked from commit 52c34a2)
Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 * Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625 * Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626 Others * Update contributing guide docs by @raffifu in #591 * Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621
Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 * Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625 * Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626 Others * Update contributing guide docs by @raffifu in #591 * Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621 (cherry picked from commit 635fb7b)
Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 * Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625 * Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626 Others * Update contributing guide docs by @raffifu in #591 * Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621 (cherry picked from commit 635fb7b)
@david-mag We just published the release 1.2.1: |
Description
When running tests with dbt, dbt allows users to run them in different modes. As described in the linked dbt documentation:
Exactly this is the case when using cosmos when running tests with
AFTER_EACH
behaviour. When a tests depends on multiple models, a test can result in an error, because the dependent model was not run yet.Example:
with a simple DAG like this
model1 -> model2
, and a testrc_test
set up to test row-count being equal betweenmodel1
andmodel2
an error might occur when we run operations in the following order:dbt run --model model1
dbt test --model model1
-> this will result in an error, if the row count has changed between the last and this run.adding the argument
--indirect-selection buildable
in step 2 will avoid actually executing the test there, because dbt makes sure to run the test only if all dependent models are upstream. So we can continue without an error:dbt run --model model2
dbt test --model model2 --indirect-selection buildable
-> this will actually run the testrc_test
in this step.By adding
--indirect-selection buildable
we have avoided running the testrc_test
twice (in both themodel1
DagGroup andmodel2
DagGroup), running it only once inmodel2
.Checklist