Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) #613

david-mag · 2023-10-18T22:25:47Z

Description

When running tests with dbt, dbt allows users to run them in different modes. As described in the linked dbt documentation:

The "buildable", "cautious", and "empty" modes can be useful in environments when you're only building a subset of your DAG, and you want to avoid test failures in "eager" mode caused by unbuilt resources.

Exactly this is the case when using cosmos when running tests with AFTER_EACH behaviour. When a tests depends on multiple models, a test can result in an error, because the dependent model was not run yet.

Example:
with a simple DAG like this model1 -> model2, and a test rc_test set up to test row-count being equal between model1 and model2 an error might occur when we run operations in the following order:

dbt run --model model1
dbt test --model model1 -> this will result in an error, if the row count has changed between the last and this run.

adding the argument --indirect-selection buildable in step 2 will avoid actually executing the test there, because dbt makes sure to run the test only if all dependent models are upstream. So we can continue without an error:

dbt run --model model2
dbt test --model model2 --indirect-selection buildable -> this will actually run the test rc_test in this step.

By adding --indirect-selection buildable we have avoided running the test rc_test twice (in both the model1 DagGroup and model2 DagGroup), running it only once in model2.

Checklist

I have made corresponding changes to the documentation (if required)
I have added tests that prove my fix is effective or that my feature works
I have updated a test that was affected by this change.

netlify · 2023-10-18T22:25:53Z

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name	Link
🔨 Latest commit	`1a187f6`
🔍 Latest deploy log	https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/6536419fa146020008301e63

codecov · 2023-10-19T10:04:03Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c82e6bc) 93.21% compared to head (1a187f6) 93.25%.
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #613      +/-   ##
==========================================
+ Coverage   93.21%   93.25%   +0.03%     
==========================================
  Files          53       53              
  Lines        2079     2091      +12     
==========================================
+ Hits         1938     1950      +12     
  Misses        141      141

Files	Coverage Δ
cosmos/airflow/graph.py	`100.00% <100.00%> (ø)`
cosmos/config.py	`93.81% <100.00%> (+0.06%)`	⬆️
cosmos/constants.py	`100.00% <100.00%> (ø)`
cosmos/converter.py	`97.14% <100.00%> (+0.04%)`	⬆️
cosmos/operators/base.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tatiana

Hi @david-mag !

Thank you for bringing up the issue and promptly writing a PR to fix it.

I find this a very interesting problem, and I agree that:

We must solve the problem of tests that depend on multiple dbt nodes, without raising an error
We should allow users to customize tests --indirect-selection flag

What I'm not convinced is:
a) That we should set --indirect-selection=buildable by default.
b) That using --indirect-selection=buildable would give us an accurate DAG representation of what is going on.

Given the problem you described:
with a simple DAG like this model1 -> model2, and a test rc_test set up to test row-count being equal between model1 and model2

Which would make sense from a DAG topology perspective?
i) To have a single Airflow test task depending on both models' Airflow run both model tasks.
ii) To have two Airflow test nodes - one running for each model run task - but one may not be running all the tests for that given model, but it will still be considered a successful Airflow task, even if that was the only test being run for that specific model.
iii) To have three Airflow test nodes - one running tests that are exclusive to model1 (we could continue using the TaskGroup approach we have, in this sense), other running tests exclusive to model2, and a third Airflow task running test that depends on both models.

With the solution in the current PR, we'd have (ii), but I believe that (iii) is the most accurate approach to solving the problem.

Regarding having (a) as default, my main concern is that this is not the default behaviour in dbt itself - and I believe there is a good reason. It may hide underlying issues.

To implement (a), we could expose the indirect_selection flag in a similar way to how we're handling full_refresh:
https://github.com/astronomer/astronomer-cosmos/blob/main/cosmos/operators/local.py#L112

I would love to hear your thoughts on these points before we agree on a way to move forward.

david-mag · 2023-10-19T11:11:48Z

I guess there is a discussion to be had on the DAG topology, and (iii) would make the most sense from that perspective for sure.

However, the way I see it, this would put another layer of complexity to the DAG/graph creation that I'm not sure is warranted atm (maybe it will be in the future). As an example: in the manifest.yaml file, tests are not tagged. When you run dbt test --select/exclude tag:some_tag, dbt will figure out which tests are connected to which models with the relevant tags and run those. That is something that cosmos would then have to "recreate", which will most likely need several iterations to get completely right. Relying on dbt itself is imho the safest and probably also a future-proof solution.

As for b): in the rather complicated dbt project I'm working with, when running dbt build on the entire model (which is basically what we would do running a DAG from start to finish), the test in question is run only once and after all dependent models are finished. Thus using --indirect-selection=buildable is actually doing what dbt does. As for the DAG representation, we of course will end up with some (probably rare) cases where the test part of a DagGroup would do basically nothing (which is a "misrepresentation" to some degree), but I believe that is a minor issue. Given the fact that it seems nobody else ran into this issue before me (at least nobody that mentioned it), most ppl run tests that don't depend on multiple models. For those tests --indirect-selection=buildable would change nothing.

As to why dbt uses --indirect-selection=eager by default, I'm unsure. I can only guess, that their "default" interaction with dbt is assumed to be running the entire model, or downstream (--select model+), for which --indirect-selection=buildable and --indirect-selection=eager should by the same.

That being said, for me the cosmos default of --indirect-selection=buildable for tests makes sense. Exposing this flag to be changed by the user would obviously not hurt. I just implemented it statically to get the change out there, since certain DAGs will be broken the way we build them atm. If we want to this from the start, or further down the line, of course, is up to you guys to decide ^^

jensenity · 2023-10-19T11:19:00Z

I agree with @tatiana as well that we should not make the buildable as default since not many of us are familiar with the indirect selection and if the default mode is diff from the dbt trst default then it will cause more confusion. But i do agree with adding a flag to state what kind of test you want to run for model1 and for model2.

david-mag · 2023-10-19T11:52:56Z

Honestly, I believe for ppl that run into the same problem I did, it will be more confusing not to have buildable as default. If you are used to working purely with dbt, where you run DAGs very differently than we are forced to do within airflow (each model runs separately), tests would never be run twice.

It took me a minute to really figure out what was going on. For ppl just switching to cosmos, encountering an error for a test that never failed before, will reduce their trust in cosmos itself.

I think the way airflow forces us to run model by model really makes defaulting to buildable kind of a requirement here, even if we allow ppl to choose. (Again, ppl that run tests without any interactions between models won't see a difference, and if they are curious about this --indirect-selection=buildable thingy they see in the airflow logs if they look very carefully, a quick google search will tell them what it does)

jlaneve · 2023-10-19T14:46:12Z

This is a great discussion! @tatiana and I talked and I think we have a plan:

in this PR, let's introduce this as a configurable option but not change the default yet. I don't want this discussion to block @david-mag from fixing their implementation
I'm going to open up an issue (or multiple) to continue this conversation, because I think there are a few aspects to this (execution, topology, etc) and I want to ensure we can continue the discussion around this
after we continue the conversation, we can open up follow-on PRs to change the default, topology, and other things that come out of the discussion

@david-mag thank you for opening this PR so quickly with such a great description and justification! The only change I'd make is instead of manually setting the mode to buildable, can we make this a user-specified, optional parameter? For current use cases, then, nothing would change; but you could modify your DAG to include:

    ...,
    indirect_selection="buildable",

Do you also mind throwing together a quick docs page in the configuration section? Either in the testing behavior page or an independent page, but it doesn't need to be anything fancy - can just link out to the dbt docs page you linked in the PR. I'm also happy to do this part if you'd prefer.

Great work!

david-mag · 2023-10-19T17:31:32Z

Yeah, I can expose the setting. Where do you think would be the best place? In the RenderConfig?

jlaneve · 2023-10-19T17:36:39Z

Yeah, I can expose the setting. Where do you think would be the best place? In the RenderConfig?

Hmm I'd start with ExecutionConfig the way it is, because it doesn't change rendering - only execution

…-cosmos into buildable_dbt_tests

david-mag · 2023-10-20T06:42:44Z

@jlaneve I've updated the PR, but I think I'll take you up on your offer on updating the docs ^^

tatiana

This looks great, @david-mag , thank you very much for addressing the feedback so quickly!
Since this is a bug fix, that was blocking you from using Cosmos, we'll release this as part of 1.2.1.

We'll add the necessary docs. We'll merge once the tests pass.

tatiana · 2023-10-20T11:16:35Z

@david-mag it seems some checks are failing, could you fix them, please?

FAILED tests/airflow/test_graph.py::test_build_airflow_graph_with_after_each - TypeError: build_airflow_graph() missing 1 required positional argument: 'test_indirect_selection'
FAILED tests/airflow/test_graph.py::test_build_airflow_graph_with_after_all - TypeError: build_airflow_graph() missing 1 required positional argument: 'test_indirect_selection'

https://github.com/astronomer/astronomer-cosmos/actions/runs/6579169038/job/17893989074?pr=613

david-mag · 2023-10-20T11:51:09Z

Will have a look. I didn't get integration tests working locally yet though, is there an easy setup for them?

david-mag · 2023-10-21T10:06:42Z

integration test should be fixed now, wasn't able to confirm it locally though.

tatiana · 2023-10-23T09:32:00Z

@david-mag all tests are passing! 🎉
Is there any chance you could add a test to cover the behaviour of the flag from an operator perspective, addressing the code-cov feedback:

Here is an example of how the test could be done:

astronomer-cosmos/tests/operators/test_local.py

Lines 330 to 347 in b7381c8

    
           @pytest.mark.parametrize( 
        
               "operator_class,kwargs,expected_call_kwargs", 
        
               [ 
        
                   (DbtSeedLocalOperator, {"full_refresh": True}, {"context": {}, "cmd_flags": ["--full-refresh"]}), 
        
                   (DbtRunLocalOperator, {"full_refresh": True}, {"context": {}, "cmd_flags": ["--full-refresh"]}), 
        
                   (DbtTestLocalOperator, {"full_refresh": True}, {"context": {}}), 
        
                   ( 
        
                       DbtRunOperationLocalOperator, 
        
                       {"args": {"days": 7, "dry_run": True}, "macro_name": "bla"}, 
        
                       {"context": {}, "cmd_flags": ["--args", "days: 7\ndry_run: true\n"]}, 
        
                   ), 
        
               ], 
        
           ) 
        
           @patch("cosmos.operators.local.DbtLocalBaseOperator.build_and_run_cmd") 
        
           def test_operator_execute_with_flags(mock_build_and_run_cmd, operator_class, kwargs, expected_call_kwargs): 
        
               task = operator_class(profile_config=profile_config, task_id="my-task", project_dir="my/dir", **kwargs) 
        
               task.execute(context={}) 
        
               mock_build_and_run_cmd.assert_called_once_with(**expected_call_kwargs)

We're aiming to have the 1.2.1 release with this change either today or tomorrow

david-mag · 2023-10-23T09:41:37Z

just did :), I hope that covers this ^^

…ect-selection buildable) (#613) When running tests with dbt, dbt allows users to run them in different [modes](https://docs.getdbt.com/reference/node-selection/test-selection-examples#indirect-selection). As described in the linked dbt documentation: > The "buildable", "cautious", and "empty" modes can be useful in environments when you're only building a subset of your DAG, and you want to avoid test failures in "eager" mode caused by unbuilt resources. Exactly this is the case when using cosmos when running tests with `AFTER_EACH` behaviour. When a tests depends on multiple models, a test can result in an error, because the dependent model was not run yet. Example: with a simple DAG like this `model1 -> model2`, and a test `rc_test` set up to test row-count being equal between `model1` and `model2` an error might occur when we run operations in the following order: 1. `dbt run --model model1` 2. `dbt test --model model1` -> this will result in an error, if the row count has changed between the last and this run. adding the argument `--indirect-selection buildable` in step 2 will avoid actually executing the test there, because dbt makes sure to run the test only if *all* dependent models are upstream. So we can continue without an error: 3. `dbt run --model model2` 4. `dbt test --model model2 --indirect-selection buildable` -> this will actually run the test `rc_test` in this step. By adding `--indirect-selection buildable` we have avoided running the test `rc_test` twice (in both the `model1` DagGroup and `model2` DagGroup), running it only once in `model2`. (cherry picked from commit 1bf3a3e)

@MrBones757

Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 Others * Update contributing guide docs by @raffifu in #591 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621

@MrBones757

Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 Others * Update contributing guide docs by @raffifu in #591 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621 (cherry picked from commit 52c34a2)

@MrBones757

Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 * Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625 * Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626 Others * Update contributing guide docs by @raffifu in #591 * Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621

@MrBones757

Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 * Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625 * Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626 Others * Update contributing guide docs by @raffifu in #591 * Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621 (cherry picked from commit 635fb7b)

@MrBones757

Bug fixes * Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605 * Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613 * Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599 * Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606 * Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625 * Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626 Others * Update contributing guide docs by @raffifu in #591 * Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624 * Fix running test that validates manifest-based DAGs by @tatiana in #619 * pre-commit updates in #604 and #621 (cherry picked from commit 635fb7b)

tatiana · 2023-10-25T23:00:52Z

@david-mag We just published the release 1.2.1:

run dbt tests in --indirect-selection buildable mode

0b98af6

david-mag requested a review from a team as a code owner October 18, 2023 22:25

david-mag requested a review from a team October 18, 2023 22:25

david-mag had a problem deploying to external October 18, 2023 22:25 — with GitHub Actions Error

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

45d13ca

pre-commit-ci bot temporarily deployed to external October 18, 2023 22:27 Inactive

tatiana changed the title ~~run dbt tests in --indirect-selection buildable mode~~ Run dbt tests in --indirect-selection buildable mode Oct 19, 2023

tatiana reviewed Oct 19, 2023

View reviewed changes

tatiana added this to the 1.3.0 milestone Oct 19, 2023

david-mag added 2 commits October 19, 2023 21:08

expose indirect_selection in ExecutionConfig

e0ba66b

Merge branch 'buildable_dbt_tests' of github.com:david-mag/astronomer…

4eb0a5b

…-cosmos into buildable_dbt_tests

david-mag temporarily deployed to external October 19, 2023 19:11 — with GitHub Actions Inactive

tatiana changed the title ~~Run dbt tests in --indirect-selection buildable mode~~ Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) Oct 20, 2023

tatiana approved these changes Oct 20, 2023

View reviewed changes

tatiana modified the milestones: 1.3.0, 1.2.1 Oct 20, 2023

david-mag temporarily deployed to external October 20, 2023 11:15 — with GitHub Actions Inactive

update tests for after_each / after_all

766d544

david-mag temporarily deployed to external October 20, 2023 15:30 — with GitHub Actions Inactive

add test for dbt base operator

a496d6f

david-mag had a problem deploying to external October 23, 2023 09:41 — with GitHub Actions Error

add None test as well

1a187f6

david-mag temporarily deployed to external October 23, 2023 09:49 — with GitHub Actions Inactive

tatiana approved these changes Oct 23, 2023

View reviewed changes

tatiana merged commit 1bf3a3e into astronomer:main Oct 23, 2023
40 checks passed

tatiana mentioned this pull request Oct 23, 2023

Fix Airflow DAG topology when test depends on multiple models #620

Closed

tatiana mentioned this pull request Oct 25, 2023

Release 1.2.1 #627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) #613

Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) #613

david-mag commented Oct 18, 2023

netlify bot commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 19, 2023 •

edited

Loading

tatiana left a comment

david-mag commented Oct 19, 2023 •

edited

Loading

jensenity commented Oct 19, 2023

david-mag commented Oct 19, 2023 •

edited

Loading

jlaneve commented Oct 19, 2023

david-mag commented Oct 19, 2023

jlaneve commented Oct 19, 2023

david-mag commented Oct 20, 2023

tatiana left a comment

tatiana commented Oct 20, 2023

david-mag commented Oct 20, 2023

david-mag commented Oct 21, 2023

tatiana commented Oct 23, 2023

david-mag commented Oct 23, 2023

tatiana commented Oct 25, 2023

Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) #613

Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) #613

Conversation

david-mag commented Oct 18, 2023

Description

Checklist

netlify bot commented Oct 18, 2023 • edited Loading

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

codecov bot commented Oct 19, 2023 • edited Loading

Codecov Report

tatiana left a comment

Choose a reason for hiding this comment

david-mag commented Oct 19, 2023 • edited Loading

jensenity commented Oct 19, 2023

david-mag commented Oct 19, 2023 • edited Loading

jlaneve commented Oct 19, 2023

david-mag commented Oct 19, 2023

jlaneve commented Oct 19, 2023

david-mag commented Oct 20, 2023

tatiana left a comment

Choose a reason for hiding this comment

tatiana commented Oct 20, 2023

david-mag commented Oct 20, 2023

david-mag commented Oct 21, 2023

tatiana commented Oct 23, 2023

david-mag commented Oct 23, 2023

tatiana commented Oct 25, 2023

netlify bot commented Oct 18, 2023 •

edited

Loading

codecov bot commented Oct 19, 2023 •

edited

Loading

david-mag commented Oct 19, 2023 •

edited

Loading

david-mag commented Oct 19, 2023 •

edited

Loading