-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAP-548] [Feature] Flag to opt out of order by
when cluster keys are specified
#606
Comments
order by
when cluster keys are specifiedorder by
when cluster keys are specified
The 💡 While we wouldn't take it lightly, we're open to changing the current behavior that adds For context, discussion surrounding the original clustering implementation for dbt SQL models is in these pull requests: A key thing for us to do would be to read those earlier discussions to thoroughly understand their rationale. In particular, I think we'd need to understand the performance implications of the (8) combinations of these scenarios:
Could you give a read though the comments in those pull requests (dbt-labs/dbt-core#1591 in particular) and let me know how it affects your thinking? I need to do a more thorough reading over there as well. |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Commenting to leave this open, I intend to respond soon! |
Adding a comment here to bump this as we have run into the same issue as @jaypeedevlin. @dbeatty10, I might be missing something, but why would the performance implications be a blocker if the suggestion is to add an optional flag that allows users to alter the default behavior? I think @jaypeedevlin is correct that removing the Edit: Also, from what I can tell, there's no ordering that's done when the incremental is done with the default temp |
@dbeatty10 I have quickly reviewed the old pull requests and feature requests. I believe the decision to add the "order by" to the cluster key config was an over optimization that was influenced by old snowflake clustering behavior and best practices of other users. I do not see any reason the "order by" should be forced upon a normal DBT user and it should at the very least be optional. I don't have historical Snowflake documentation in front of me but I believe at some point they recommended "clustering" your own tables by sorting data on insert. Around the time the cluster key feature was added to dbt for Snowflake, a big change was made in how Snowflake clustering works, which was the removal of manual reclustering. Review of the github discussion seems to suggest the "order by" was added in an attempt to supplement the removal of the manual "recluster" command by manually "clustering" the data before it was inserted into the new table by first sorting it. Snowflake clustering now works through background tasks operated by Snowflake. They call this auto clustering. Snowflake does not expect data to be pre-sorted or clustered when creating or inserting into a clustered table. Snowflake's background process will recluster the table at an indeterminate time later. Theoretically there could be a use case where you could improve query performance between the insert/creation time and the reclustering process execution by pre-sorting. This is very niche though and I do not believe this should be forced upon the user. The result of the current behavior is that DBT is effectively front-loading the work of clustering by first sorting the data. This can potentially be a very expensive operation. Snowflake auto clustering will automatically recluster and sort this data later (most likely more efficiently and cheaper) so there is no reason for DBT to do so. This is no longer a recommended practice by Snowflake and there is normally no reason for a user to do this unless they are very concerned about query performance before reclustering occurs. Additionally from a user experience perspective, I believe adding an expensive operation such as an "order by" behind a seemingly unrelated feature is deceptive to the user. Maybe I am biased from my own experience but I am confident that there are many DBT users that have suffered from poor performance due to this "order by" and have no idea it was even added to their query. |
Snowflake automatic clustering documentation: https://docs.snowflake.com/en/user-guide/tables-auto-reclustering |
Still waiting on feedback on this issue. |
It seems like we want to change the default eventually to not perform the order by. We would want to do this with a combination of Behavior Flags and a new model/profile config where we default to the existing behavior in the new config and then change the behavior to default to not performing the order by |
Is this your first time submitting a feature request?
Describe the feature
When we add cluster keys to an incremental model, there is an
order by
added to the model SQL that sorts the model before the merge.With extremely large tables (ie those most likely to require clustering), this 'front-loads' some of the clustering into the model run itself. I find this undesirable for two reasons:
I am proposing to add a flag, where the default value is to keep things as is, and when the flag is
True
that theorder by
is not added to the model. I do actually think that a change in default behaviour would be desirable for most folks, but I feel like I'm more likely to get a greenlight on a version of this that doesn't alter behavior.Describe alternatives you've considered
Overriding the macro or creating a custom materialisation.
Who will this benefit?
Folks with big data™.
Are you interested in contributing this feature?
Yes!
Anything else?
No response
The text was updated successfully, but these errors were encountered: