Add percentage support for error_if, warn_if #5172

jaypeedevlin · 2022-04-26T23:49:51Z

resolves #4723, in collaboration with @ehmartens

Description

Adds support for using percentages in the error_if and warn_if configurations.

We'd love some feedback on the following:

Whether the method is too 'hacky' and/or if there are alternate ways of going about this.
Where to add tests — we took a look but couldn't find exactly where the configs are currently tested, and additionally noted that the tests are currently moving between two frameworks

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

jtcohen6

Thanks for taking a first crack at this @jaypeedevlin @ehmartens!

I'm going to tag in the Language team to help here, since I think the real question is actually around the metadata that generic tests can make available within their materializations. Ideally, we'd have access to the rendered form of "{{ get_where_subquery(<ref_or_source>) }}", i.e. that which gets passed in as the {{ model }} argument to generic test definitions.

jtcohen6 · 2022-04-27T08:08:54Z

core/dbt/include/global_project/macros/materializations/tests/helpers.sql

@@ -3,12 +3,27 @@
 {%- endmacro %}

 {% macro default__get_test_sql(main_sql, fail_calc, warn_if, error_if, limit) -%}
+    {% set model_name = model.file_key_name.split('.')[1] %}


Two thoughts:

For tests defined on models, this should work. For tests defined on sources, file_key_name is sources.source_name (and missing the name of the specific source table), so I don't think it would work as currently constituted. We could update file_key_name to include sources.source_name.source_table.

If users have defined a where config, and we're calculating %, I think we would want to filter both the numerator and denominator by that where config. You could reproduce that logic here (config.get("where")), but all it all it seems preferable if there were some where to access the rendered version of {{ model }} (from the generic test query) or {{ test_metadata.kwargs.model }} (from the test node), since that's really the thing we want.

Great point on the 'where' @jtcohen6, we totally overlooked that! We've added that in, but not pushed the change yet, while we wait to hear from the language team on how we might approach getting the non-model resources appropriately.

jtcohen6 · 2022-04-27T08:15:27Z

core/dbt/include/global_project/macros/materializations/tests/helpers.sql

+
+
+{% macro get_pct_division(threshold, model_name) %}
+  {% if threshold[-1] == '%' %}


Could you imagine someone wanting to define a % failure threshold for a singular test (as opposed to generic tests)? I can't think of one myself

We took a list at that this morning, I think I probably could think of someone wanting to do this (in a rare case), but it wasn't clear to me why the current approach wouldn't work (given that thresholds should work in the same way). Could you help me understand how that might work?

In the case where a singular test depends on multiple nodes, I'm not sure how we can tell which node should represent the denominator?

I also think, in a singular test, you're writing all the SQL yourself, it's pretty trivial to add a / (select count(*) from total * 100 and define warn_if/error_if with percents-as-integers.

gshank · 2022-05-04T16:06:47Z

I played with some code in #5214. It doesn't handle the rendering bit yet, but thought I'd get feedback on whether this is what we're thinking of.

jaypeedevlin · 2022-05-09T06:53:18Z

Hey @gshank, are you able to provide any guidance on how to test your changes in #5214 in conjunction with our current branch?

gshank · 2022-05-09T15:06:51Z

I took another look at this, and we're creating the same strings in the generic_test_builder, and rendering them in 'add_rendered_test_kwargs'. Then at compilation time they are added to the context with the top-level key "_dbt_generic_test_kwargs", and the key 'model'. So I'm wondering if we can use that instead making a new function.

That rendered 'model' string is used by 'build_raw_sql' in the generic_test_builder which builds the test sql.

jtcohen6 · 2022-05-14T00:14:07Z

@gshank Really good point! This "just works" for me if I start storing _dbt_generic_test_kwargs on the test node (which seems easier than trying to pass it through the different Jinja contexts).

f0dc98a

Granted, I'm taking a pretty simple approach here, and just overwriting the raw test_metadata.kwargs with the compiled versions. Is there a reason we shouldn't do that? Doesn't it feel... better? We're not doing anything with the raw versions, not even using it for the same_contents comparison method that powers state:modified.

version: 2
models:
  - name: my_model
    columns:
      - name: id
        tests:
          - unique:
              warn_if: "=5%"
          - not_null:
              config:
                where: 1=1
                error_if: ">5%"

At runtime (using BigQuery):

  select
      count(*) as failures,
      count(*)  != 0 as should_warn,
      count(*) / (select count(*) from (select * from `dbt-test-env`.`dbt_jcohen`.`my_model` where 1=1) dbt_subquery) * 100 >5 as should_error
...
    select
      count(*) as failures,
      count(*) / (select count(*) from `dbt-test-env`.`dbt_jcohen`.`my_model`) * 100 =5 as should_warn,
      count(*)  != 0 as should_error

jtcohen6 · 2022-05-20T15:19:15Z

core/dbt/include/global_project/macros/materializations/tests/helpers.sql

+{% macro get_pct_division(threshold, model_name) %}
+  {% if threshold[-1] == '%' %}
+    {% set threshold = threshold[:-1] %}
+    {% set pct_division = '/ (select count(*) from ' ~ ref(model_name) ~ ') * 100' %}


Instead of count(*) always, should this be {{ fail_calc }} for the denominator as well? That won't work in cases where the generic test query synthesizes a new column for use in the fail_calc, which isn't present in the underlying {{ model }}, but it feels better than assuming it's always a simple count.

jaypeedevlin · 2022-08-16T22:28:09Z

I'm going to close this PR for now, since we're not actively working on it.

gracie2339 · 2024-04-24T20:37:13Z

Hi All, I found a workaround with this by writing a generic test in dbt and then referencing it in the yml files like any other ordinary relationships test:

This one has a 1% error threshold and works for BigQuery. Feel free to change as you see fit.

{% test relationships_with_1_pct_error_threshold(model, column_name, field, to) %}

with parent as (
    select
        {{ field }} as id
    from {{ to }}
),

child as (
    select 
        {{ column_name }} as id
    from {{ model }}
),

error_threshold as (
    select
        cast(count(distinct id) * .01 as int) as num
    from parent
)

select distinct id
from child 
where id is not null
    and id not in (select id from parent)
    and (
        select count(distinct id) 
        from child
        where id is not null
            and id not in (select id from parent)
        ) > (select num from error_threshold)

{% endtest %}

Add percentage support for error_if, warn_if

f2b33cf

cla-bot bot added the cla:yes label Apr 26, 2022

Josh Devlin added 2 commits April 27, 2022 09:56

Add changelog

6a46d83

remove whitespace

85db3bd

jtcohen6 added the Team:Language label Apr 27, 2022

jtcohen6 reviewed Apr 27, 2022

View reviewed changes

jaypeedevlin marked this pull request as ready for review May 3, 2022 22:27

jaypeedevlin requested a review from a team May 3, 2022 22:27

jaypeedevlin requested review from a team as code owners May 3, 2022 22:27

jaypeedevlin requested review from ChenyuLInx and VersusFacit May 3, 2022 22:27

gshank mentioned this pull request May 4, 2022

Possible method to get model_str from test node #5214

Closed

vergenzt mentioned this pull request May 18, 2022

[CT-228] [Feature] Allow tests to warn/fail based on percentage #4723

Open

1 task

jtcohen6 mentioned this pull request May 20, 2022

Threshold argument for Test Severity #2219

Closed

jtcohen6 reviewed May 20, 2022

View reviewed changes

ChenyuLInx removed their request for review August 16, 2022 17:43

jaypeedevlin closed this Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add percentage support for error_if, warn_if #5172

Add percentage support for error_if, warn_if #5172

jaypeedevlin commented Apr 26, 2022 •

edited

Loading

jtcohen6 left a comment

jtcohen6 Apr 27, 2022

jaypeedevlin May 3, 2022

jtcohen6 Apr 27, 2022

jaypeedevlin May 3, 2022

jtcohen6 May 14, 2022

gshank commented May 4, 2022

jaypeedevlin commented May 9, 2022

gshank commented May 9, 2022

jtcohen6 commented May 14, 2022

jtcohen6 May 20, 2022

jaypeedevlin commented Aug 16, 2022

gracie2339 commented Apr 24, 2024



		{% macro get_pct_division(threshold, model_name) %}
		{% if threshold[-1] == '%' %}

Add percentage support for error_if, warn_if #5172

Add percentage support for error_if, warn_if #5172

Conversation

jaypeedevlin commented Apr 26, 2022 • edited Loading

Description

Checklist

jtcohen6 left a comment

Choose a reason for hiding this comment

jtcohen6 Apr 27, 2022

Choose a reason for hiding this comment

jaypeedevlin May 3, 2022

Choose a reason for hiding this comment

jtcohen6 Apr 27, 2022

Choose a reason for hiding this comment

jaypeedevlin May 3, 2022

Choose a reason for hiding this comment

jtcohen6 May 14, 2022

Choose a reason for hiding this comment

gshank commented May 4, 2022

jaypeedevlin commented May 9, 2022

gshank commented May 9, 2022

jtcohen6 commented May 14, 2022

jtcohen6 May 20, 2022

Choose a reason for hiding this comment

jaypeedevlin commented Aug 16, 2022

gracie2339 commented Apr 24, 2024

jaypeedevlin commented Apr 26, 2022 •

edited

Loading