Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining vars, folder-level configs outside dbt_project.yml #2955

Open
jtcohen6 opened this issue Dec 15, 2020 · 28 comments
Open

Defining vars, folder-level configs outside dbt_project.yml #2955

jtcohen6 opened this issue Dec 15, 2020 · 28 comments
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day Refinement Maintainer input needed vars yaml

Comments

@jtcohen6
Copy link
Contributor

Describe the feature

From @benjaminsingleton:

I’d like to use project level variables more, but I’m concerned about bloat to my already large dbt-project.yml file. I think it would be helpful if I could create a variables.yml file that could be imported in dbt-project.yml . And for that matter, the same could be done for other configurations in the dbt_project.yml file. I think having the ability to separate configurations into different files might make for improved modularity / separation of concerns (particularly for large projects), not to mention fewer merge conflicts. CC @jrandrews

Describe alternatives you've considered

  • We're already thinking of enabling some configs in resource-YAML files (Set configs in schema.yml files #2401), but these would be at the level of the individual resource (model/seed/snapshot/etc) only
  • The dbt_project.yml gets really really big??

Additional context

  • I don't think this has any correspondence to v1.0. It's a nice thing to have, and we can could do it before, after, any time without it being a breaking change in any way.

Who will this benefit?

  • Developers and maintainers of increasingly big dbt projects
@jtcohen6 jtcohen6 added the enhancement New feature or request label Dec 15, 2020
@codigo-ergo-sum
Copy link

codigo-ergo-sum commented Dec 16, 2020

Some additional thoughts/failure modes/concerns to think about with this:

  1. In large dbt projects, one problem with project-wide variables is the potential for developers to "step" on each other by editing or overwriting or conflicting with each other's var declarations. If devs were being meticulous in looking at the project-wide relevance of a given var, then this might happen less or not at all, but that is unfortunately often not the case. If we allow variable declaration (and other things) outside of just one file (say dbt_project.yml), and it is an arbitrary number of files, then I can see Dev1 defining my_var_p in variable_file_1.yml and then Dev2 defining my_var_p in variable_file_2.yml. I suppose/hope that dbt would detect that and throw and error but there are still some clunky workflow issues in allowing variable declarations in multiple different .yml files.
  2. Vars need to be parsed before other .yml files for declarations around models, tests, etc. are parsed. Right now this problem is handled by having one hard-coded .yml file (that is, dbt_project.yml) to be parsed before the other .yml files, but if we loosen this then dbt still needs a way to be able to determine how/what to parse in "pass 1" of parsing for vars (and I am sure a lot of other things that other, smarter people than I know already happen first :) ) versus "pass 2" of parsing for other things like tests, models, etc. And not just dbt -- this understanding of what .yml file gets parsed when needs to be not-too-hard to quickly understand for average devs. Otherwise people will be just littering random var declarations mixed in with tests and model config and then getting confused why things don't work.

One thought that comes to me - what if we had another separate set of subdirectories that were specifically dedicated to .yml files for config, like we already have a subdirectory declaration/space for snapshots, models, etc. Only things related to/extracted from dbt_project.yml could be put in there, and any other things related to models, snapshots, etc., would trigger a parsing error. This doesn't fix problem 1 above but it at least helps with problem 2. What say ye?

P.S. Also, I know var namespacing was removed for dbt_project.yml v2 config in .17 for some good reasons but it's also pretty hard not to have any way to do variable scoping in larger projects.

@jtcohen6
Copy link
Contributor Author

@codigo-ergo-sum I don't think I've ever seen you post from this alt account before. It goes without saying that I like the handle.

what if we had another separate set of subdirectories that were specifically dedicated to .yml files for config, like we already have a subdirectory declaration/space for snapshots, models, etc.

This is along the lines of what I was thinking: either an explicit set of subdirectories, or an explicit set of named files. I've been around just long enough to remember when packages was a special dict in dbt_project.yml rather than its own file; we split it out because we expected it to grow in size, and because it served a distinct purpose. We made the same choice for selectors.yml.

I'd be especially keen on a vars.yml: variables have a slightly different parsing context, we can be strict about accepting only literal values, and we could even do a better job of parsing vars.yml before parsing dbt_project.yml. That would make default values of vars called in dbt_project.yml work the way folks expect, rather than how it is today. I like that correspondence between vars.yml and CLI --vars, similar to how env vars can be sourced from an *.env file or prepended to a CLI command.

Configurations feel a bit trickier, because these can be especially verbose. How to coordinate hierarchies across multiple files without someone tripping over someone else? I'm honestly not sure. The cleanest separation I can envision would be allowing a project to have one each of models.yml, seeds.yml, etc.

P.S. Also, I know var namespacing was removed for dbt_project.yml v2 config in .17 for some good reasons but it's also pretty hard not to have any way to do variable scoping in larger projects.

This is fair. I wonder if the ability to scope vars differently for different model subsets may ultimately serve as a valid reason to split very big projects up into multiple sub-projects, installed as packages. That's regardless of whether they live in the same or separate repositories.

@codigo-ergo-sum
Copy link

Thanks for the compliment on the username @jtcohen6 :).

I think a vars.yml file would be a definite improvement over the current situation.

Would it be required or could vars also still be defined in dbt_project.yml? If required then that probably requires a new version 3 of the schema version for dbt_project.yml which is a sigificant change for existing users, right?

If not required, then what would the behavior look like if vars are defined in both places, and if they conflict? And are you suggesting that vars.yml would be parsed before dbt_project.yml is parsed? Allowing full, "no-gotcha" usage of vars in dbt_project.yml?

@danielefrigo
Copy link
Contributor

I absolutely support the idea of parsing the vars before the dbt_project.yml.
This would enable leveraging vars in many additional ways, e.g. to enable or disable subfolders or defining schemas from vars, without loosing the ability to simply run a model using the vars default values.

@moltar
Copy link

moltar commented Oct 25, 2021

Having outside vars available in dbt_project.yml would be a huge improvement.

We have a complex dbt_project.yml, with lots of repetition and using Jinja a lot.

For example:

source-paths:
  - modules/shared/models
  - modules/module1/models
  - modules/module2/models
  - stages/{{ env_var('DBT_STAGE', '@fake@') }}/models

Then enabling a particular stage via env var during deployment.

Would be great to set the stage once in the vars file, and then just use the var itself in the config. Also be able to define module names / prefixes, or even an entire array of modules to loop over.

@krazavet-tinyclues
Copy link

Hello 👋

In our company, we are using a lot DBT in a multi tenant context. For that purpose, we rely a lot on DBT variables with which we propagate the client configuration. Those configurations could be really different from a client to another.
Sometimes we have faced the following issue argument list too long: dbt, which is due to the large config payload (e.g. some of them could reach more than 600Kb).

We did not find a proper workaround for now. Passing a file path instead of a payload for our variables would probably solve our issue. This is why we are keen to know if there is any chance you are going to consider such feature for DBT ? (cc. @jtcohen6)

Thank you in advance 🙏

@ybressler
Copy link

++ this feature. The solution implemented by Jekyll (with _data directory) comes to mind as suitable.

@ciprian-mandras
Copy link

Hi all,

I agree that a vars.yml will be a good boost, but there you'll have just some global variables. Based on my background experience I think you should think as well to a solution for local variables. Some sort of accepting in a model configuration to define a model_vars.yml and use the variables for that specific model from there.

Thank you.

@itechprasanth
Copy link

I agree. Hope this will get implemented soon as it is always a good practice to modularize the configurations, rather than having everything in same single file.

@jtcohen6 jtcohen6 added paper_cut A small change that impacts lots of users in their day-to-day Refinement Maintainer input needed labels Nov 28, 2022
@vitorefazevedo
Copy link

dbt still only allow global variables defined in dbt_projetc.yml?

@codigo-ergo-sum
Copy link

Just looking through issue backlogs and wanted to bump this... Would be great as we are working with projects that have tens or even hundreds of variables now. Also the lack of ability to namespace them is still challenging.

@apolorei
Copy link

apolorei commented Jun 2, 2023

I'm also facing this problem and would very much love some ideas of how to tackle it!

@timvw
Copy link

timvw commented Jun 5, 2023

Currently we're trying to workaround this issue by using environment variables (and tooling via direnv and a .envrc file)

@jeremyyeo
Copy link
Contributor

Sneaky workaround whilst wait for this to be built into core. Basically move var declarations into macro files:

https://gist.github.com/jeremyyeo/06d552ee8facc8100416655ebc25d9b9

@krazavet-tinyclues
Copy link

krazavet-tinyclues commented Jun 12, 2023

Sneaky workaround whilst wait for this to be built into core. Basically move var declarations into macro files:

https://gist.github.com/jeremyyeo/06d552ee8facc8100416655ebc25d9b9

This is exactly what we started to POC in our DBT stack. Using a dedicated macro file to load bigger JSON payload.
The idea is to generate a macro file containing all DBT variables. At the end it should look to something like that 👇

{% macro get_config() %}
  {{ return(fromjson("<JSON_CONTENT_HERE>")) }} 
{% endmacro %}

And then you can use it in your model:

{% set some_var = get_config().get(...) %}

That's a workaround that should make the job.

@markproctor1
Copy link

markproctor1 commented Aug 7, 2023

Folder-level configs would make a huge difference on my project. With 50+ developers and growing we don't want anyone to modify project-level files day-to-day, but we do want them to manage many files and folders in their subject area.

Folder-level configs would do this. Clearly the need is there which is why they are featured in dbt_project.yml but this causes governance and git conflict problems where many teams trying to make changes to project-level files at the same time.

Basically, I need to treat our subject areas as mini-projects, each mini-project having its own configuration.

@dbeatty10
Copy link
Contributor

Within #8869, @slotrans described var() not being able see vars defined in dbt_project.yml for the purposes of configuring query-comment.

If this feature request were added, then it would solve that use-case.

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Apr 16, 2024
@rlh1994
Copy link
Contributor

rlh1994 commented Apr 16, 2024

Apu Jumps 16042024082159

@epapineau
Copy link
Contributor

@jtcohen6 & @dbeatty10 - are y'all open to contributions on this one?

@mroy-seedbox
Copy link

@ciprian-mandras: Regarding this:

I agree that a vars.yml will be a good boost, but there you'll have just some global variables. Based on my background experience I think you should think as well to a solution for local variables. Some sort of accepting in a model configuration to define a model_vars.yml and use the variables for that specific model from there.

Couldn't you just use model configs for that? In our project, we put all sorts of stuff in the meta key all the time.

Like this:
some_model.yml

models:
- name: some_model
  description: Something something
  config:
    tags: [tag1, tag2]
  meta:
    key1:
      key_x: value
      key_y: value
      key_z: value
    key2:
      key_a: value
      key_b:
        key: value
    key3: value
    key4: value
    key5: value

@mroy-seedbox
Copy link

mroy-seedbox commented Sep 25, 2024

For everyone talking about namespacing variables, couldn't you just do it within a dict variable?

Like this:

vars:
  defaults:
    key1: value_a
    key2: value_b
    key3: value_c
  namespace1:
    key1: value1
    key2: value2
  namespace2:
    key1: value_x
    key2: value_y

And then you could retrieve those with an alternative macro (instead of using var('name')). Something like this:

{%- macro ns_var(namespace, key, default = None) -%}
  {%- if default == None and key not in var(namespace) and key not in var("defaults") -%}
    {{ exceptions.raise_compiler_error("Missing variable '" ~ key ~ "' in namespace " ~ namespace) }}
  {%- endif -%}
  {{- return(var(namespace).get(key, var("defaults").get(key, default))) -}}
{%- endmacro -%}

And call it like:

{{ ns_var('namespace1', 'key1') }} -- "value1"
{{ ns_var('namespace1', 'key3') }} -- "value_c"
{{ ns_var('namespace1', 'key4', 'default') }} -- "default"
{{ ns_var('namespace1', 'key4') }} -- Error

Or you could be fancy and do stuff like:

{{ ns_var('namespace1.key1') }} -- Although `default` remains a separate param, which is weird.
{{ ns1_var('key1') }} -- Hardcoded namespace inside of this macro.

@mroy-seedbox
Copy link

mroy-seedbox commented Sep 25, 2024

@codigo-ergo-sum: Regarding this:

I think a vars.yml file would be a definite improvement over the current situation.

Would it be required or could vars also still be defined in dbt_project.yml? If required then that probably requires a new version 3 of the schema version for dbt_project.yml which is a sigificant change for existing users, right?

If not required, then what would the behavior look like if vars are defined in both places, and if they conflict? And are you suggesting that vars.yml would be parsed before dbt_project.yml is parsed? Allowing full, "no-gotcha" usage of vars in dbt_project.yml?

I think multiple var files could even easily be supported. The only validation dbt would have to do is to make sure the same variable (i.e. top-level key/namespace) does not exist in more than one file (including dbt_project.yml, if any variables are still defined there). Otherwise, dbt should produce an error. That's it. The end.

Currently that is supported (although it's probably just how PyYAML loads the file), but it probably shouldn't be (and it's a reasonable breaking change as it's a very easy fix: just remove the duplicate):

vars:
  key: 123 # Just delete this one, it does nothing.
  key: 456 # Right now, this one "wins".

So we could end up with something like vars_abc.yml:

vars:
  abc:
    key: value
    ...

And vars_xyz.yml:

vars:
  xyz:
    key: value
    ...

But not vars_qrs.yml:

vars:
  abc: # Can't use `abc` again!
    key: value
    ...

And since those are specifically var files, we wouldn't even need that top-level vars: key.

@mroy-seedbox
Copy link

mroy-seedbox commented Sep 25, 2024

Folder-level configs would make a huge difference on my project. With 50+ developers and growing we don't want anyone to modify project-level files day-to-day, but we do want them to manage many files and folders in their subject area.

Folder-level configs would do this. Clearly the need is there which is why they are featured in dbt_project.yml but this causes governance and git conflict problems where many teams trying to make changes to project-level files at the same time.

Basically, I need to treat our subject areas as mini-projects, each mini-project having its own configuration.

@markproctor1: I think scattering variables/files everywhere is a terrible idea/bad practice. What do you think of the namespace idea suggested above instead?

And if dbt added support for multiple var files (all in one specific place), each of your subject areas/mini-projects could have its own file & namespace. No more merge conflicts! 🙌

EDIT: Although, thinking about it again now... scattering var files all over the place could be acceptable if dbt_project.yml had a config like this:

var-paths:
  - variables.yml
  - team1/variables.yml
  - team2/some_folder/variables.yml
  - ...

But then the variable files couldn't be loaded before dbt_project.yml, which sounded like a nice advantage to have. So probably still not a good idea to scatter var files around. 😅

Instead, we could just have a simple vars folder, which should be enough for 99.9% of dbt users:

vars/globals.yml
vars/team1.yml
vars/team2.yml

EDIT 2: UNLESS... dbt could also add a new paths.yml file, which would be loaded first. Then the var-paths would be loaded. And then dbt_project.yml and the rest of the stuff would be loaded.

But all that would be completely optional (like a power/advanced feature), and only enabled if paths.yml exists.

And for those who want to use variables inside of their paths for certain things (models paths, etc.), then just don't put those paths inside of paths.yml! 🙈 Keep them inside of dbt_project.yml, which should now have access to all the project variables.

@mroy-seedbox
Copy link

This feature is also highly related to #4873.

@graciegoheen
Copy link
Contributor

graciegoheen commented Dec 12, 2024

Problems

It sounds like there are three problems being discussed here:

  1. "I want to define project-wide variables outside of my dbt_project.yml."
  2. "I want to be able to reference variables in my dbt_project.yml at parse-time."
  3. "I want to have name-spaced variables."

Potential Acceptance Criteria for 1 & 2

  • I can define project-wide ("global variables") variables in a vars.yml file, separate from my dbt_project.yml
# vars.yml

vars:
  start_date: '2016-06-01'
  • dbt parses vars.yml before parsing dbt_project.yml, meaning default values of vars called in dbt_project.yml would work the way folks expect, rather than how it is today (equivalent to how this works when defining vars through the CLI a la --vars)

Option 1 - behavior change

  • You cannot define variables in both places, you must chose the "new" or "legacy" way
  • As such, this would need to be behind a behavior change flag
  • Open question: Given that behavior change flags are set in dbt_project.yml, would this even be possible?

Option 2 - backwards compatible

  • You can define variables in either places
  • Open question: Would we still get the parsing benefit if we went this route?

Current workaround: Have a really massive dbt_project.yml file with no ability to reference variables in your project configurations

Potential Acceptance Criteria for 3

  • You can define name-spaced ("local variables") variables (spec tbd) that would compile differently based on where it's being used
  • Open question: Is this folder-based? Model-based?

Current workaround: Split up your project into multiple sub-projects with their own variables - we already have project-based name-spacing

Next Steps

I think there's more clarity on what needs to be done for problems 1 & 2 - so we should go ahead and create an implementation issue for that set of acceptance criteria.

For problem 3, I think this needs to be baked more - but am open to someone starting up a Github discussion if so inspired!

@codigo-ergo-sum
Copy link

@graciegoheen great to see you post on a 3-year running discussion now :) (and just 3 years in this particular ticket, I think it's been ongoing in a notional sense since the beginning of dbt.)

Regarding #3, what's interesting is that, years ago, the ability to namespace variables in dbt_project.yml did exist and it was removed. I don't remember the reason why it was removed, however. @jtcohen6 was involved it in though. Jeremy any recollection on why it was removed? I'm just wondering if the reasoning behind that may still be germane to the discussion.

@mroy-seedbox
Copy link

mroy-seedbox commented Dec 14, 2024

@graciegoheen: Option 2 makes sense to me! 🙌

  • Treat variables in vars.yml the same as vars defined on the CLI.
  • Keep treating variables in dbt_project.yml the same.
  • Whatever is in vars.yml overrides what is in dbt_project.yml (just like CLI vars).
    • Although people should really avoid defining the same variable in two places, so this should probably produce an error instead. Otherwise it could definitely cause confusion (whereas CLI vars are more easily understood as an override).

Actually, this is even better: we can actually combine both options!

  • If vars.yml exists, produce an error if there are any variables still defined in dbt_project.yml.
  • This would remain backward-compatible without the need for a change flag (i.e. the "flag" would be the presence of vars.yml or not).

Would we still get the parsing benefit if we went this route?

  • The parsing benefits could be enabled only if the vars are defined in vars.yml instead of dbt_project.yml (i.e. if the "flag" is on).
  • This would also encourage people to migrate their vars from dbt_project.yml over to vars.yml.

Side note: We shouldn't need the top-level vars: key in vars.yml, since it should be understood that the entire file is for variables only.

# vars.yml
start_date: '2016-06-01'

And for Problem #3 (namespaced variables), I would personally keep that out of scope. There are many existing workarounds already, as enumerated in this thread. But supporting multiple var files (i.e. a vars folder?) could definitely be useful in order to help keep things clean (and those who want to namespace them could use one file per namespace).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day Refinement Maintainer input needed vars yaml
Projects
None yet
Development

No branches or pull requests