-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure extract calls return consistent results across engines #812
Conversation
Our initial date_part implementation essentially assumed all engines had the same behavior for various date part extraction calls, but this is not the case, as things like day of week and year can vary from engine to engine depending on whether or not they hew to the ISO standards for such things. This commit adds some engine tests for SqlExtractExpression so we can track and enforce consistency across engine types.
Support for date_part `week` is fraught with peril, as there is no good standard way of extracting a week number from a date. Because we hew to the calendar standard for date_part `year` (as opposed to using ISO year, which defines the year start as the Monday of the week containing the first Thursday in January) it would be odd to use ISO week of year, which is based on the ISO year. However, no other standard exists, and so we simply remove the property from the enum and thereby drop support for it. We can add it later and apply a well-defined behavior to it, if necessary. For more details on the mayhem please refer to the docblock changes for the DatePart enum that accompany this commit.
Databricks, DuckDB, Postgres, and Snowflake all have formal support for extracting a day of week value that conforms to the ISO standard of 1 (Monday) - 7 (Sunday). This commit uses that support and enables the new ISO behavior for those engines. BigQuery and Redshift, which require custom rendering for day of week extraction, will be updated separately.
This completes the ISO normalization of day of week results. Now all engines should return the ISO standard 1 (Monday) - 7 (Sunday) values for day of week extraction.
Current dependencies on/for this PR: This comment was auto-generated by Graphite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for getting this all sorted out! Looks good to me - left one comment but nothing blocking.
return extract_rendering_result | ||
|
||
extract_stmt = extract_rendering_result.sql | ||
case_expr = f"CASE WHEN {extract_stmt} = 0 THEN {extract_stmt} + 7 ELSE {extract_stmt} END" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason you use IF()
for some engines but CASE WHEN
for others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, would it be better to define SqlExpressionNodes
for IF()
and CASE WHEN
statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used IF
because it's shorter, but it's not ANSI standard so not all engines support it. That's probably a mistake, maybe I'll update it before merging.
As for the SqlExpressionNodes - yes, long term I think that's what we should do. We can have a SqlCaseExpressionNode that does nicely formatted case statements, that seems broadly useful. I didn't bother here because I couldn't figure out how to cleanly transform the extract node without doing it outside of the rendering layer. There are a bunch of options to think through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after getting annoyed by the visitor pattern yet again (I think I really went over the edge with #819 ) I've come to the conclusion that using a CASE expression node has to happen inside the renderer.
I'm going to just merge this as-is and then I'll follow up with a CASE statement node and see if I can hijack the rendering accordingly. If I can't make it nicely formatted I'll just stick with what's here until we push an update to reformat our SQL to dbt-like standards - given the bad formatting we have already the shorter IF seems fine for BigQuery even if it's not consistent with the CASE/WHEN we need to use for other engines.
The DATE_PART processing and SQL
extract
expression rendering initiallydid whatever the engines defaulted to doing. Unfortunately, this leads to
divergent results across engine contexts, particularly for day of week (DOW)
extraction.
There are three particular areas of concern here:
Day of week - the ISO standard for day of week numbering is number the
days of the week from 1 (Monday) to 7 (Sunday). Weeks always start on Mondays.
Individual engines do not follow any fixed standard for this by default, with BigQuery
returning 1 (Sunday) through 7 (Saturday) and most engines returning 0 (Sunday)
through 6 (Saturday).
Year - the Gregorian calendar has the year starting on the first of January
and ending on the 31st of December. The ISO standard declares that years start
on Monday of the week containing the first Thursday in January, so 2014-12-29 is
considered the start of 2015. Engines, happily, all agree that people typically want
the calendar year.
Week of year - the ISO standard week of year flows naturally from their choice
to start all years on Mondays. What's unnatural is the year boundary itself. So
2014-12-30 will have a week number of 1, while 2016-01-01 will have a week number
of 53. Engines have no standard - they may return a value between 0-53, 1-54, or 1-53
(same as ISO's 1-53, but with potentially different boundaries).
In order to ensure that MetricFlow returns consistent results for date/time
extraction operations, we have made the following decisions:
ISO standard week start of Monday, and the ISO standard 1-7 numbering.
This PR adds some consistency testing to ensure we have consistent behavior, and
updates our extract implementation to bring all of the engines in line with our standard.