Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) #613

Merged
merged 7 commits into from
Oct 23, 2023

Conversation

david-mag
Copy link
Contributor

Description

When running tests with dbt, dbt allows users to run them in different modes. As described in the linked dbt documentation:

The "buildable", "cautious", and "empty" modes can be useful in environments when you're only building a subset of your DAG, and you want to avoid test failures in "eager" mode caused by unbuilt resources.

Exactly this is the case when using cosmos when running tests with AFTER_EACH behaviour. When a tests depends on multiple models, a test can result in an error, because the dependent model was not run yet.

Example:
with a simple DAG like this model1 -> model2, and a test rc_test set up to test row-count being equal between model1 and model2 an error might occur when we run operations in the following order:

  1. dbt run --model model1
  2. dbt test --model model1 -> this will result in an error, if the row count has changed between the last and this run.

adding the argument --indirect-selection buildable in step 2 will avoid actually executing the test there, because dbt makes sure to run the test only if all dependent models are upstream. So we can continue without an error:

  1. dbt run --model model2
  2. dbt test --model model2 --indirect-selection buildable -> this will actually run the test rc_test in this step.

By adding --indirect-selection buildable we have avoided running the test rc_test twice (in both the model1 DagGroup and model2 DagGroup), running it only once in model2.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated a test that was affected by this change.

@david-mag david-mag requested a review from a team as a code owner October 18, 2023 22:25
@david-mag david-mag requested a review from a team October 18, 2023 22:25
@netlify
Copy link

netlify bot commented Oct 18, 2023

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
🔨 Latest commit 1a187f6
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/6536419fa146020008301e63

@pre-commit-ci pre-commit-ci bot temporarily deployed to external October 18, 2023 22:27 Inactive
@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (c82e6bc) 93.21% compared to head (1a187f6) 93.25%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #613      +/-   ##
==========================================
+ Coverage   93.21%   93.25%   +0.03%     
==========================================
  Files          53       53              
  Lines        2079     2091      +12     
==========================================
+ Hits         1938     1950      +12     
  Misses        141      141              
Files Coverage Δ
cosmos/airflow/graph.py 100.00% <100.00%> (ø)
cosmos/config.py 93.81% <100.00%> (+0.06%) ⬆️
cosmos/constants.py 100.00% <100.00%> (ø)
cosmos/converter.py 97.14% <100.00%> (+0.04%) ⬆️
cosmos/operators/base.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tatiana tatiana changed the title run dbt tests in --indirect-selection buildable mode Run dbt tests in --indirect-selection buildable mode Oct 19, 2023
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @david-mag !

Thank you for bringing up the issue and promptly writing a PR to fix it.

I find this a very interesting problem, and I agree that:

  1. We must solve the problem of tests that depend on multiple dbt nodes, without raising an error
  2. We should allow users to customize tests --indirect-selection flag

What I'm not convinced is:
a) That we should set --indirect-selection=buildable by default.
b) That using --indirect-selection=buildable would give us an accurate DAG representation of what is going on.

Given the problem you described:
with a simple DAG like this model1 -> model2, and a test rc_test set up to test row-count being equal between model1 and model2

Which would make sense from a DAG topology perspective?
i) To have a single Airflow test task depending on both models' Airflow run both model tasks.
ii) To have two Airflow test nodes - one running for each model run task - but one may not be running all the tests for that given model, but it will still be considered a successful Airflow task, even if that was the only test being run for that specific model.
iii) To have three Airflow test nodes - one running tests that are exclusive to model1 (we could continue using the TaskGroup approach we have, in this sense), other running tests exclusive to model2, and a third Airflow task running test that depends on both models.

With the solution in the current PR, we'd have (ii), but I believe that (iii) is the most accurate approach to solving the problem.

Regarding having (a) as default, my main concern is that this is not the default behaviour in dbt itself - and I believe there is a good reason. It may hide underlying issues.

To implement (a), we could expose the indirect_selection flag in a similar way to how we're handling full_refresh:
https://github.com/astronomer/astronomer-cosmos/blob/main/cosmos/operators/local.py#L112

I would love to hear your thoughts on these points before we agree on a way to move forward.

@david-mag
Copy link
Contributor Author

david-mag commented Oct 19, 2023

I guess there is a discussion to be had on the DAG topology, and (iii) would make the most sense from that perspective for sure.

However, the way I see it, this would put another layer of complexity to the DAG/graph creation that I'm not sure is warranted atm (maybe it will be in the future). As an example: in the manifest.yaml file, tests are not tagged. When you run dbt test --select/exclude tag:some_tag, dbt will figure out which tests are connected to which models with the relevant tags and run those. That is something that cosmos would then have to "recreate", which will most likely need several iterations to get completely right. Relying on dbt itself is imho the safest and probably also a future-proof solution.

As for b): in the rather complicated dbt project I'm working with, when running dbt build on the entire model (which is basically what we would do running a DAG from start to finish), the test in question is run only once and after all dependent models are finished. Thus using --indirect-selection=buildable is actually doing what dbt does. As for the DAG representation, we of course will end up with some (probably rare) cases where the test part of a DagGroup would do basically nothing (which is a "misrepresentation" to some degree), but I believe that is a minor issue. Given the fact that it seems nobody else ran into this issue before me (at least nobody that mentioned it), most ppl run tests that don't depend on multiple models. For those tests --indirect-selection=buildable would change nothing.

As to why dbt uses --indirect-selection=eager by default, I'm unsure. I can only guess, that their "default" interaction with dbt is assumed to be running the entire model, or downstream (--select model+), for which --indirect-selection=buildable and --indirect-selection=eager should by the same.

That being said, for me the cosmos default of --indirect-selection=buildable for tests makes sense. Exposing this flag to be changed by the user would obviously not hurt. I just implemented it statically to get the change out there, since certain DAGs will be broken the way we build them atm. If we want to this from the start, or further down the line, of course, is up to you guys to decide ^^

@jensenity
Copy link

I agree with @tatiana as well that we should not make the buildable as default since not many of us are familiar with the indirect selection and if the default mode is diff from the dbt trst default then it will cause more confusion. But i do agree with adding a flag to state what kind of test you want to run for model1 and for model2.

@david-mag
Copy link
Contributor Author

david-mag commented Oct 19, 2023

Honestly, I believe for ppl that run into the same problem I did, it will be more confusing not to have buildable as default. If you are used to working purely with dbt, where you run DAGs very differently than we are forced to do within airflow (each model runs separately), tests would never be run twice.

It took me a minute to really figure out what was going on. For ppl just switching to cosmos, encountering an error for a test that never failed before, will reduce their trust in cosmos itself.

I think the way airflow forces us to run model by model really makes defaulting to buildable kind of a requirement here, even if we allow ppl to choose. (Again, ppl that run tests without any interactions between models won't see a difference, and if they are curious about this --indirect-selection=buildable thingy they see in the airflow logs if they look very carefully, a quick google search will tell them what it does)

@jlaneve
Copy link
Collaborator

jlaneve commented Oct 19, 2023

This is a great discussion! @tatiana and I talked and I think we have a plan:

  • in this PR, let's introduce this as a configurable option but not change the default yet. I don't want this discussion to block @david-mag from fixing their implementation
  • I'm going to open up an issue (or multiple) to continue this conversation, because I think there are a few aspects to this (execution, topology, etc) and I want to ensure we can continue the discussion around this
  • after we continue the conversation, we can open up follow-on PRs to change the default, topology, and other things that come out of the discussion

@david-mag thank you for opening this PR so quickly with such a great description and justification! The only change I'd make is instead of manually setting the mode to buildable, can we make this a user-specified, optional parameter? For current use cases, then, nothing would change; but you could modify your DAG to include:

    ...,
    indirect_selection="buildable",

Do you also mind throwing together a quick docs page in the configuration section? Either in the testing behavior page or an independent page, but it doesn't need to be anything fancy - can just link out to the dbt docs page you linked in the PR. I'm also happy to do this part if you'd prefer.

Great work!

@tatiana tatiana added this to the 1.3.0 milestone Oct 19, 2023
@david-mag
Copy link
Contributor Author

Yeah, I can expose the setting. Where do you think would be the best place? In the RenderConfig?

@jlaneve
Copy link
Collaborator

jlaneve commented Oct 19, 2023

Yeah, I can expose the setting. Where do you think would be the best place? In the RenderConfig?

Hmm I'd start with ExecutionConfig the way it is, because it doesn't change rendering - only execution

@david-mag david-mag temporarily deployed to external October 19, 2023 19:11 — with GitHub Actions Inactive
@david-mag
Copy link
Contributor Author

@jlaneve I've updated the PR, but I think I'll take you up on your offer on updating the docs ^^

@tatiana tatiana changed the title Run dbt tests in --indirect-selection buildable mode Fix running dbt tests that depend on multiple models (support --indirect-selection buildable) Oct 20, 2023
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @david-mag , thank you very much for addressing the feedback so quickly!
Since this is a bug fix, that was blocking you from using Cosmos, we'll release this as part of 1.2.1.

We'll add the necessary docs. We'll merge once the tests pass.

@tatiana tatiana modified the milestones: 1.3.0, 1.2.1 Oct 20, 2023
@david-mag david-mag temporarily deployed to external October 20, 2023 11:15 — with GitHub Actions Inactive
@tatiana
Copy link
Collaborator

tatiana commented Oct 20, 2023

@david-mag it seems some checks are failing, could you fix them, please?

FAILED tests/airflow/test_graph.py::test_build_airflow_graph_with_after_each - TypeError: build_airflow_graph() missing 1 required positional argument: 'test_indirect_selection'
FAILED tests/airflow/test_graph.py::test_build_airflow_graph_with_after_all - TypeError: build_airflow_graph() missing 1 required positional argument: 'test_indirect_selection'

https://github.com/astronomer/astronomer-cosmos/actions/runs/6579169038/job/17893989074?pr=613

@david-mag
Copy link
Contributor Author

Will have a look. I didn't get integration tests working locally yet though, is there an easy setup for them?

@david-mag david-mag temporarily deployed to external October 20, 2023 15:30 — with GitHub Actions Inactive
@david-mag
Copy link
Contributor Author

integration test should be fixed now, wasn't able to confirm it locally though.

@tatiana
Copy link
Collaborator

tatiana commented Oct 23, 2023

@david-mag all tests are passing! 🎉
Is there any chance you could add a test to cover the behaviour of the flag from an operator perspective, addressing the code-cov feedback:
Screenshot 2023-10-23 at 10 29 11

Here is an example of how the test could be done:

@pytest.mark.parametrize(
"operator_class,kwargs,expected_call_kwargs",
[
(DbtSeedLocalOperator, {"full_refresh": True}, {"context": {}, "cmd_flags": ["--full-refresh"]}),
(DbtRunLocalOperator, {"full_refresh": True}, {"context": {}, "cmd_flags": ["--full-refresh"]}),
(DbtTestLocalOperator, {"full_refresh": True}, {"context": {}}),
(
DbtRunOperationLocalOperator,
{"args": {"days": 7, "dry_run": True}, "macro_name": "bla"},
{"context": {}, "cmd_flags": ["--args", "days: 7\ndry_run: true\n"]},
),
],
)
@patch("cosmos.operators.local.DbtLocalBaseOperator.build_and_run_cmd")
def test_operator_execute_with_flags(mock_build_and_run_cmd, operator_class, kwargs, expected_call_kwargs):
task = operator_class(profile_config=profile_config, task_id="my-task", project_dir="my/dir", **kwargs)
task.execute(context={})
mock_build_and_run_cmd.assert_called_once_with(**expected_call_kwargs)

We're aiming to have the 1.2.1 release with this change either today or tomorrow

@david-mag
Copy link
Contributor Author

just did :), I hope that covers this ^^

@david-mag david-mag temporarily deployed to external October 23, 2023 09:49 — with GitHub Actions Inactive
@tatiana tatiana merged commit 1bf3a3e into astronomer:main Oct 23, 2023
40 checks passed
tatiana pushed a commit that referenced this pull request Oct 25, 2023
…ect-selection buildable) (#613)

When running tests with dbt, dbt allows users to run them in different
[modes](https://docs.getdbt.com/reference/node-selection/test-selection-examples#indirect-selection).
As described in the linked dbt documentation:
> The "buildable", "cautious", and "empty" modes can be useful in
environments when you're only building a subset of your DAG, and you
want to avoid test failures in "eager" mode caused by unbuilt resources.

Exactly this is the case when using cosmos when running tests with
`AFTER_EACH` behaviour. When a tests depends on multiple models, a test
can result in an error, because the dependent model was not run yet.

Example:
with a simple DAG like this `model1 -> model2`, and a test `rc_test` set
up to test row-count being equal between `model1` and `model2` an error
might occur when we run operations in the following order:
1. `dbt run --model model1`
2. `dbt test --model model1` -> this will result in an error, if the row
count has changed between the last and this run.

adding the argument `--indirect-selection buildable` in step 2 will
avoid actually executing the test there, because dbt makes sure to run
the test only if *all* dependent models are upstream. So we can continue
without an error:

3. `dbt run --model model2`
4. `dbt test --model model2 --indirect-selection buildable` -> this will
actually run the test `rc_test` in this step.

By adding `--indirect-selection buildable` we have avoided running the
test `rc_test` twice (in both the `model1` DagGroup and `model2`
DagGroup), running it only once in `model2`.

(cherry picked from commit 1bf3a3e)
tatiana added a commit that referenced this pull request Oct 25, 2023
Bug fixes

* Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605
* Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613
* Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599
* Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606

Others

* Update contributing guide docs by @raffifu in #591
* Fix running test that validates manifest-based DAGs by @tatiana in #619
* pre-commit updates in #604 and #621
tatiana added a commit that referenced this pull request Oct 25, 2023
Bug fixes

* Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605
* Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613
* Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599
* Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606

Others

* Update contributing guide docs by @raffifu in #591
* Fix running test that validates manifest-based DAGs by @tatiana in #619
* pre-commit updates in #604 and #621

(cherry picked from commit 52c34a2)
tatiana added a commit that referenced this pull request Oct 25, 2023
Bug fixes

* Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605
* Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613
* Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599
* Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606
* Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625
* Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626

Others

* Update contributing guide docs by @raffifu in #591
* Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624
* Fix running test that validates manifest-based DAGs by @tatiana in #619
* pre-commit updates in #604 and #621
tatiana added a commit that referenced this pull request Oct 25, 2023
Bug fixes

* Resolve errors occurring when `dbt_project_path` is str and partial support `dbt_project_path=None` by @MrBones757 in #605
* Fix running dbt tests that depend on multiple models (support `--indirect-selection` buildable) by @david-mag in #613
* Add tests to sources, snapshots and seeds when using `TestBehavior.AFTER_EACH` by @tatiana in #599
* Fix custom selector when select has a subset of tags of the models' tags by @david-mag in #606
* Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when `ProfileMapping` is used by @tatiana in #625
* Fix failure if `openlineage-common` raises a jinja exception by @tatiana in #626

Others

* Update contributing guide docs by @raffifu in #591
* Remove unnecessary stack trace from Cosmos initialization by @tatiana in #624
* Fix running test that validates manifest-based DAGs by @tatiana in #619
* pre-commit updates in #604 and #621

(cherry picked from commit 635fb7b)
@tatiana tatiana mentioned this pull request Oct 25, 2023
tatiana added a commit that referenced this pull request Oct 25, 2023
Bug fixes

* Resolve errors occurring when `dbt_project_path` is str and partial
support `dbt_project_path=None` by @MrBones757 in #605
* Fix running dbt tests that depend on multiple models (support
`--indirect-selection` buildable) by @david-mag in #613
* Add tests to sources, snapshots and seeds when using
`TestBehavior.AFTER_EACH` by @tatiana in #599
* Fix custom selector when select has a subset of tags of the models'
tags by @david-mag in #606
* Fix `LoadMode.AUTOMATIC` behaviour to use `LoadMode.DBT_LS` when
`ProfileMapping` is used by @tatiana in #625
* Fix failure if `openlineage-common` raises a jinja exception by
@tatiana in #626

Others

* Update contributing guide docs by @raffifu in #591
* Remove unnecessary stack trace from Cosmos initialization by @tatiana
in #624
* Fix running test that validates manifest-based DAGs by @tatiana in
#619
* pre-commit updates in #604 and #621

(cherry picked from commit 635fb7b)
@tatiana
Copy link
Collaborator

tatiana commented Oct 25, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants