Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPIKE+] Improve the Performance Characteristics of add_test_edges() #10950

Closed
1 task done
peterallenwebb opened this issue Oct 30, 2024 · 2 comments · Fixed by #11092
Closed
1 task done

[SPIKE+] Improve the Performance Characteristics of add_test_edges() #10950

peterallenwebb opened this issue Oct 30, 2024 · 2 comments · Fixed by #11092
Assignees

Comments

@peterallenwebb
Copy link
Contributor

Housekeeping

  • I am a maintainer of dbt-core

Short description

The add_test_edges() function is called during the dbt build command, and inserts edges into the execution graph which are meant to ensure that models downstream from a node will not run until all the tests on that node have passed.

The function is slow in certain projects, and recent data from the field show that it inflates the number of edges in the graph by a factor of six. It is slow enough that it often shows up in performance profiles, but is even more problematic in terms of memory consumption, as memory use is high enough to cause OOM crashes.

Acceptance criteria

  1. If possible, implement a new version of this function which adds edges to achieve the desired test-dependency behavior but inserts fewer edges and runs more quickly.
  2. Add a new behavior flag which causes the new function to be used, while retaining the old function on the default code path.
  3. Follow up by gathering data about the relative performance of the two implementations and monitoring for regressions.

Suggested Tests

Existing tests should suffice, but we should add additional tests to reduce the risks associated with the new implementation.

Impact to Other Teams

None.

Will backports be required?

No.

Context

No response

@peterallenwebb peterallenwebb added user docs [docs.getdbt.com] Needs better documentation triage performance and removed user docs [docs.getdbt.com] Needs better documentation labels Oct 30, 2024
@peterallenwebb peterallenwebb self-assigned this Oct 30, 2024
@MichelleArk
Copy link
Contributor

As @peterallenwebb noted, a source of complexity here is that this add_test_edges currently accounts for tests that depend on multiple models, not just one. It may be difficult to take similar approaches for running test nodes "just in time" after a model completes during handle_job_queue if certain tests depend on multiple models before they can run

@ChenyuLInx
Copy link
Contributor

ChenyuLInx commented Nov 4, 2024

One thought here is to remove the transitive edgestest1 -> model 3(add_test_edges).

@gshank mentioned we can also only do this operation for selected parts of the DAG or not build it when people select tests in build command.

@ChenyuLInx ChenyuLInx changed the title Improve the Performance Characteristics of add_test_edges() [SPIKE+] Improve the Performance Characteristics of add_test_edges() Nov 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants