Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spike: inferred contracts #8339

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented Aug 8, 2023

resolves #7432
docs dbt-labs/docs.getdbt.com/#

Problem

Solution

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX

🎩

model.sql

select 'test' as test, 2 as test2, 3 as test3

schema.yml

version: 2
models:
  - name: model1
    config:
      contract:
        enforced: true
        inferred: true
    columns: 
      - name: test3
        data_type: int
        constraints:
          - type: unique

👟

  1. Set enforced: false (& optionally comment columns out)
  2. dbt run to create a prod version of the model
  3. Set enforced: true (& optionally comment columns back in)
  4. dbt run --defer --state target/

@cla-bot cla-bot bot added the cla:yes label Aug 8, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Aug 8, 2023

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@codecov
Copy link

codecov bot commented Aug 8, 2023

Codecov Report

Merging #8339 (b8f10de) into main (d18a74d) will decrease coverage by 6.69%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #8339      +/-   ##
==========================================
- Coverage   86.35%   79.66%   -6.69%     
==========================================
  Files         174      174              
  Lines       25573    25577       +4     
==========================================
- Hits        22083    20376    -1707     
- Misses       3490     5201    +1711     
Flag Coverage Δ
integration 79.66% <100.00%> (-3.55%) ⬇️
unit ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
core/dbt/context/providers.py 87.51% <100.00%> (-1.15%) ⬇️
core/dbt/contracts/graph/manifest.py 90.01% <100.00%> (-3.76%) ⬇️
core/dbt/contracts/graph/model_config.py 86.88% <100.00%> (-6.93%) ⬇️
core/dbt/parser/schemas.py 88.75% <100.00%> (-4.00%) ⬇️

... and 101 files with indirect coverage changes

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as bad as I thought! Now we just need to think through, is it something we actually want?

Comment on lines +1385 to +1386
# TODO: avoid routing on args.which if possible
if getattr(self.model, "defer_relation", None) and self.config.args.which == "clone":
Copy link
Contributor

@jtcohen6 jtcohen6 Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely want to avoid the need for this. I had previously wondered if we'd be able to consolidate the deferral methods used by clone and other commands, by:

  • clobbering unselected resources with the production versions (to support "traditional" deferral)
  • adding defer_relation for selected resources only (to support clone)

(this comment/issue: #7965)

Edit: Michelle already made this change below. Then we can also refactor the clone task to use merge_from_artifact, and delete add_from_artifact.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussing live with @MichelleArk: We should overwrite compiled_code (also aliased as sql) within the clone task. Then we don't need to have this task-aware conditional logic live in the context method.

{% set prod_columns = get_column_schema_from_query(get_empty_subquery_sql("select * from " ~ defer_relation)) %}
{% set raw_columns = {} %}
{% for column in prod_columns -%}
{# TODO: reconsider this mutating operation, could be done with an additional warehouse call #}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 At minimum, we should probably be deepcopying to avoid mutating the user-provided value stored in the context... right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can either:

  • plumb it through (but this takes some time, and tightly couples the methods)
  • make the warehouse call twice (and consider caching/optimization possibilities in the future)

@MichelleArk MichelleArk changed the title spike: inferred + partial contracts spike: inferred contracts Aug 10, 2023
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 6, 2023

Ah, I knew I'd been forgetting something. From the issue:

Crucially, we should allow column additions, but raise a ContractBreakingChangeError if an existing column is renamed/removed/retyped.

I think this is two things.

(1) Adding new columns

Right now, to add a column to a model with an inferred contract, you need to explicitly define its name + data type. Otherwise, you get a ContractError.

13:35:49    Compilation Error in model model1 (models/model1.sql)
  This model has an enforced contract that failed.
  Please ensure the name, data_type, and number of columns in your contract match the columns in your model's definition.

  | column_name | definition_type | contract_type | mismatch_reason     |
  | ----------- | --------------- | ------------- | ------------------- |
  | test4       | INT             |               | missing in contract |

I think we have two options:

  • (a) We figure out some way (not sure how) to combine the column names/types inferred from prod with the column names/types inferred from get_column_schema_from_query. The problem then is we'd need to do type translation between the cursor types and the DWH types (gross).
  • (b) We say, c'est comme ça. Once the model has an enforced contract, all new columns must be declared explicitly. (Once the updated model has been run in prod, you could then remove those column specifications, and have them inferred instead.) I'm leaning toward this option.

(2) Detecting breaking changes

We also need to update our logic for detecting ContractBreakingChangeError. Basically, we cannot rely on the contract being fully specified within the static yaml config. So for any models with an "inferred" contract, I think we need to:

  • No-op the same_contract check within state:modified
  • Instead perform the detailed comparison while the model is being run. If there's a column in prod that's missing in dev/CI, or the same column name with different data types, that should be a ContractBreakingChangeError. (Edit: Need to figure out if we can reliably detect data type diffs for columns that were previously inferred, and are now being defined explicitly.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2460] [Feature] Infer schema from prod, to enforce contract & detect breaking changes in dev
2 participants