Change metadata_vars `if not` to `if ... is None` #37

truls-p · 2024-01-23T19:50:25Z

(Opened this PR instead of dbt-labs/dbt-core#9376)

Both None and {} evaluate to True when using if not metadata_vars. This caused us to loop over the environment variables unnecessarily. This provides a significant performance increase for large projects.

resolves #

Relates to dbt-labs/dbt-core#6073

Problem

Poor performance with large models.

I was trying to profile a large model and noticed get_metadata_vars is called many times and we were spending quite a bit of time in in a dictcomp. I then realised that whenever we do not have any environment variables with the "DBT_ENV_CUSTOM_ENV_" prefix the metadata_vars dictionary will be set to {} the first time get_metadata_vars is called. On subsequent calls, we continue to loop over the environment variables due to the if condition, if not metadata_vars, which evaluates to True for both None and {}.

Solution

Solution was to change if not metadata_vars to if metadata_vars is None. This ensures we only loop through the environment variables looking for the prefix once.

Here is a test that could be run before dbt-common was split off to its own repo and the pytest-benchmark results on my machine. Note that this test may take up to 30 min to complete. Please let me know if you think a shorter test (perhaps one that just ensures we do not unnecessarily loop through os.environ multiple times) should be included in this PR.

import pytest
import os

from dbt.tests.util import run_dbt
from dbt.tests.fixtures.project import write_project_files

import dbt.common.events.functions

from tests.fixtures.dbt_integration_project import dbt_integration_project  # noqa: F401
from tests.functional.graph_selection.fixtures import SelectionFixtures

NUM_TABLES = 500
metadata_vars = None
_METADATA_ENV_PREFIX = "DBT_ENV_CUSTOM_ENV_"


def run_schema_and_assert(project, include, exclude, expected_tests):
    # deps must run before seed
    run_dbt(["deps"])
    run_dbt(["seed"])
    results = run_dbt(["build", "--exclude", "never_selected"])
    assert len(results) == 18 + NUM_TABLES

    test_args = ["test"]
    if include:
        test_args += ["--select", include]
    if exclude:
        test_args += ["--exclude", exclude]
    test_results = run_dbt(test_args)

    ran_tests = sorted([test.node.name for test in test_results])
    expected_sorted = sorted(expected_tests)

    assert ran_tests == expected_sorted

def get_metadata_vars_is():
    global metadata_vars
    if metadata_vars is None:
        metadata_vars = {
            k[len(_METADATA_ENV_PREFIX) :]: v
            for k, v in os.environ.items()
            if k.startswith(_METADATA_ENV_PREFIX)
        }
    return metadata_vars

def get_metadata_vars_not():
    global metadata_vars
    if not metadata_vars:
        metadata_vars = {
            k[len(_METADATA_ENV_PREFIX) :]: v
            for k, v in os.environ.items()
            if k.startswith(_METADATA_ENV_PREFIX)
        }
    return metadata_vars

class TestPerformanceManyTables(SelectionFixtures):
    @pytest.fixture(scope="class", autouse=True)
    def setUp(self, project_root, dbt_integration_project):  # noqa: F811

        sql_string = "select * from {{ this.schema }}.seed"
        for i in range(NUM_TABLES):
            dbt_integration_project["models"][f"{i}.sql"] = sql_string
        write_project_files(project_root, "dbt_integration_project", dbt_integration_project)

    @pytest.fixture(scope="class")
    def packages(self):
        return {"packages": [{"local": "dbt_integration_project"}]}


    @pytest.mark.skipif(os.environ.get('DBT_PERF_TEST') is None, reason="DBT_PERF_TEST not set")
    def test_performance_many_tables_specify_tag_not(self, project, benchmark, monkeypatch):
        monkeypatch.setattr(dbt.common.events.functions, 'get_metadata_vars', get_metadata_vars_not)
        benchmark.pedantic(run_schema_and_assert, args=(project, "tag:bi", None, ["unique_users_id", "unique_users_rollup_gender"]), iterations=5)


    @pytest.mark.skipif(os.environ.get('DBT_PERF_TEST') is None, reason="DBT_PERF_TEST not set")
    def test_performance_many_tables_specify_tag_is(self, project, benchmark, monkeypatch):
        monkeypatch.setattr(dbt.common.events.functions, 'get_metadata_vars', get_metadata_vars_is)
        benchmark.pedantic(run_schema_and_assert, args=(project, "tag:bi", None, ["unique_users_id", "unique_users_rollup_gender"]), iterations=5)

Checklist

I have read the contributing guide and understand what's expected of me
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
This PR includes type annotations for new and modified functions

Both None and {} evaluates to True when using if not metadata_vars. This caused us to loop over the environment variables unnecessarily. This provides a significant performance increase for large projects.

peterallenwebb · 2024-01-24T17:23:12Z

@truls-p Thank you for you contribution to dbt-core!

Change metadata_vars if not to if ... is None

8a2bd16

Both None and {} evaluates to True when using if not metadata_vars. This caused us to loop over the environment variables unnecessarily. This provides a significant performance increase for large projects.

truls-p marked this pull request as ready for review January 23, 2024 19:52

truls-p mentioned this pull request Jan 23, 2024

Change metadata_vars if not to if ... is None dbt-labs/dbt-core#9376

Closed

5 tasks

peterallenwebb approved these changes Jan 24, 2024

View reviewed changes

peterallenwebb merged commit 5ce19dd into dbt-labs:main Jan 24, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change metadata_vars `if not` to `if ... is None` #37

Change metadata_vars `if not` to `if ... is None` #37

truls-p commented Jan 23, 2024 •

edited

Loading

peterallenwebb commented Jan 24, 2024

Change metadata_vars if not to if ... is None #37

Change metadata_vars if not to if ... is None #37

Conversation

truls-p commented Jan 23, 2024 • edited Loading

Problem

Poor performance with large models.

Solution

Checklist

peterallenwebb commented Jan 24, 2024

Change metadata_vars `if not` to `if ... is None` #37

Change metadata_vars `if not` to `if ... is None` #37

truls-p commented Jan 23, 2024 •

edited

Loading