-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add flag --no-skip-on-failure for command build, run, retry, seed #9020
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9020 +/- ##
===========================================
- Coverage 88.13% 61.90% -26.24%
===========================================
Files 178 178
Lines 22456 22464 +8
===========================================
- Hits 19792 13906 -5886
- Misses 2664 8558 +5894
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
f88c041
to
af96e00
Compare
af96e00
to
362da1c
Compare
@leo-schick do you know where this PR stands today? Looks like would be great for our use case. |
It works |
@leo-schick you said it works but you didn't specify on which version of DBT |
@i-brahimi It works on the main development branch which uses version tag |
@jtcohen6 There are so many people using DBT waiting for this feature ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leo-schick, thank you so much for opening this PR! I just ran into the same issue so I'm thrilled to see that someone is already working on it.
4321942
to
3ec1e03
Compare
@cwcartmell all requested changes are done now. |
Has there been any word on getting this feature into dbt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had some more time on my hands so I came back to this PR post-dbt 1.8.0. I think that changes that have been made recently have broken this PR. When I pull down the fork and pull upstream main
into the branch, I get errors when trying to run dbt commands:
$ dbt test
01:42:42 Running with dbt=1.9.0-a1
01:42:42 Registered adapter: postgres=1.9.0-a1
01:42:42 Unable to do partial parsing because profile has changed
01:42:43 Found 2 models, 4 data tests, 413 macros
01:42:43
01:42:44 Concurrency: 4 threads (target='dev')
01:42:44
01:42:44 1 of 4 START test not_null_my_first_dbt_model_id ............................... [RUN]
01:42:44 2 of 4 START test not_null_my_second_dbt_model_id .............................. [RUN]
01:42:44 3 of 4 START test unique_my_first_dbt_model_id ................................. [RUN]
01:42:44 4 of 4 START test unique_my_second_dbt_model_id ................................ [RUN]
01:42:44 2 of 4 ERROR not_null_my_second_dbt_model_id ................................... [ERROR in 0.12s]
01:42:44 1 of 4 ERROR not_null_my_first_dbt_model_id .................................... [ERROR in 0.13s]
01:42:44 4 of 4 ERROR unique_my_second_dbt_model_id ..................................... [ERROR in 0.09s]
01:42:44 3 of 4 ERROR unique_my_first_dbt_model_id ...................................... [ERROR in 0.10s]
Exception in thread Thread-7:
Traceback (most recent call last):
File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/multiprocessing/pool.py", line 592, in _handle_results
cache[job]._set(i, obj)
File "/home/conrad/.pyenv/versions/3.8.13/lib/python3.8/multiprocessing/pool.py", line 776, in _set
self._callback(self._value)
File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 281, in callback
self._handle_result(result)
File "/home/conrad/Documents/dbt-core/core/dbt/task/compile.py", line 132, in _handle_result
super()._handle_result(result)
File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 338, in _handle_result
self._mark_dependent_errors(node.unique_id, result, cause)
File "/home/conrad/Documents/dbt-core/core/dbt/task/runnable.py", line 421, in _mark_dependent_errors
no_skip_on_failure = get_flags().NO_SKIP_ON_FAILURE
File "/home/conrad/Documents/dbt-core/core/dbt/cli/flags.py", line 360, in __getattr__
return super().__getattribute__(name) # type: ignore
AttributeError: 'Flags' object has no attribute 'NO_SKIP_ON_FAILURE'
Seems pretty unfortunate. This is a pretty critical feature to prevent one failure from having everything fail. |
@cwcartmell Thanks for checking and pointing this out. I am actually quite disappointed with the dbt dev team here. This pull request is now open over 6 months (the original ticket is from Feb. 2020) and there has not been any response from the dbt team as far as I can see if this feature is wanted or not - or - should be implemented in another way. It’s not such a difficult code change and the community responses tell the story that many people would like to have a feature to process models even downstream models fail. But the lack of response from the dbt team is just disappointing to me right now and I don‘t know if it is worth fixing this now or spending first the time to get someone from the dbt core team on this to respond if such a feature is wanted or not (e.g. by design decision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leo-schick I'm sorry for the delayed response here. This issue had been open for a long time; your PR has helped it gain renewed traction.
The last time I thought about this in detail was ~2 years ago, and we’ve been trying to define a clearer, narrower, more maintainable scope for dbt-core
since then. I’ve become warier of feature requests that feel they go against some of dbt’s core principles, one of which is the integrity of the DAG. Whether to skip downstream nodes on upstream errors is less of a dogmatic concern than allowing node-level cycles, or running the DAG out of order, but it’s one of the simplifying assumptions that’s gotten us a lot over time.
That being said:
- It’s already possible to achieve the end state described here, by retrying a failed job while simply excluding the failing models from the previous run. Users have pointed out that this is already possible (retry from point of failure, skip point of failure, keep building downstream) — but it requires quite a bit of custom work(around).
- This is analogous to marking a test as “warn”: notify me that it detected failures, but don’t stop the DAG for it.
I reread #2142 to remind myself of the use cases we’re solving for, and the majority of the examples provided deal with issues with upstream data ingestion that can cause flakey model failure. People would rather have stale data coming from one place than stale transformations across the board. There is a judgment call here, model by model. To that end, I am more convinced by the value of this as a model-level config, defined on parent model, than as a runtime-level config set for all models at once.
Previously I said:
If we decide to support both, I'd see this being similar to how the --full-refresh flag / full_refresh model config work today (docs), where the model-level config (if set) takes precedence over the run-level flag.
That doesn’t feel right to me. It good insofar as it’s consistent with full-refresh and store-failures, but I don’t think we nailed the interplay of full-refresh flag and model-level config.
I coul see this as behavior that wants to differ across environments: dbt should always skip (strict) in dev/CI, but allow specific downstream models to build in deployment environments. It would be possible to accomplish that with conditional Jinja configs, but probably better to have some rule that combines both a model configuration and a flag setting.
So here are the options I see, in order of preference (1 then 2 then 3 then 4):
- By default, any model in any environment skips children on failures. A specific flaky model can opt into on_error: continue for a specific environment.
- By default, any model in any environment skips children on failure. A specific model can opt into on_error: continue for all environments (dev, prod, CI, etc).
- By default, any model in any environment skips children on failures. In some invocations/environments, you can optionally switch this behavior for all models to on_error: continue.
- By default, any model in any environment skips children on failures. For some models, you can set them to always skip or always continue. In some invocations/environments, can also optionally switch the behavior for all other (unset) models.
This PR implements (3), with the option to add in node-level config later on (4). I don’t think we could go from 3 -> 2 without a breaking change, since the continue-on-skip behavior becomes “AND” instead of “OR.”
Where I'm landing is: I'm open to adding this as a node-level config (exception not the rule), but I don't think we should start with the global runtime flag. If we decide a flag is appropriate to support conditional per-environment behavior ("strict" in dev/CI, "forgiving" in Prod), we can do that later on.
Next steps:
- Do you (and the folks who have been following this thread) agree with my line of reasoning above?
- Would you (or someone else) be interested in carrying that work forward? Again, apologies for the delay on our end, and I understand if it means you're no longer able or interested.
for dep_node_id in self.graph.get_dependent_nodes(UniqueId(node_id)): | ||
self._skipped_children[dep_node_id] = cause | ||
no_skip_on_failure = get_flags().NO_SKIP_ON_FAILURE | ||
if not no_skip_on_failure: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm thinking about this right, it would also have the effect for dbt build
of not skipping downstream models if an upstream test fails. That doesn't feel desirable as a global effect.
For my team our need is specifically to have a global runtime flag. The majority of our pipeline is constructed in a way where we still want to run the entire pipeline regardless of the state of parent models and leave the decision making regarding the validity of that data to a later point. As an example, there might be one column in a later model that is impacted by a parent having incorrect data/not running due to an error but that does not mean we would not want to refresh the table as a whole anyway. It is more beneficial to report that inconsistency in a different manner rather than simply not have that child model run. As a direct consequence of not having this we are making the decision to maintain specific DAG tasks per model/test which is incredibly inefficient and butchers a lot of the great features dbt provides. |
@jtcohen6 Thanks for reviving this issue! I wanted to provided one use case where it would be helpful to be able to set this on the child model rather than the parent: We maintain a bunch of user attributes in individual models so we can more easily control each one. We then have an incremental model that unions the new records from all the attribute models so we can sync to an external source. If one of the attribute models fails, we still want to run the final model and insert the records from all the models that succeeded. In this case, I think it would make the most sense to set the skip failure rule at the child level instead of the parent. Sure, I could set the flag on the folder level for all the attribute models in I think a good parallel to how I envision this would be Airflow's |
resolves #2142
Problem
Downstream models are skipped. Sometimes e.g. when working in big data warehouses, it might be desirable to proceed with downstream models even a upstream model failed.
Solution
Adds a new flag
--no-skip-on-failture
for the dbt commandsbuild
,run
,retry
,clone
andseed
. If set, the downstream models will not be added toself._skipped_children
and therefore will not be skipped.Note: user docs might need to be updated.
Testing
I created a initial project with two additional models:
must_fail
- always failsdependent_model
- uses the failing modelTest 1 - run without the new flag
The model
dependent_model
is still skipped.Test 2 - run with flag
--no-skip-on-failture
The model
dependent_model
is now executed, but fails since the table formust_fail
does not exist. (This is expected behavior. In a real world scenario the failing model table might already exist from a previous execution and the modeldependent_model
would process successfully).Test 3 - run retry with flag
--no-skip-on-failture
The failing models are executed as expected.
Checklist