Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure extract calls return consistent results across engines #812

Merged
merged 4 commits into from
Oct 25, 2023

Conversation

tlento
Copy link
Contributor

@tlento tlento commented Oct 16, 2023

The DATE_PART processing and SQL extract expression rendering initially
did whatever the engines defaulted to doing. Unfortunately, this leads to
divergent results across engine contexts, particularly for day of week (DOW)
extraction.

There are three particular areas of concern here:

  1. Day of week - the ISO standard for day of week numbering is number the
    days of the week from 1 (Monday) to 7 (Sunday). Weeks always start on Mondays.
    Individual engines do not follow any fixed standard for this by default, with BigQuery
    returning 1 (Sunday) through 7 (Saturday) and most engines returning 0 (Sunday)
    through 6 (Saturday).

  2. Year - the Gregorian calendar has the year starting on the first of January
    and ending on the 31st of December. The ISO standard declares that years start
    on Monday of the week containing the first Thursday in January, so 2014-12-29 is
    considered the start of 2015. Engines, happily, all agree that people typically want
    the calendar year.

  3. Week of year - the ISO standard week of year flows naturally from their choice
    to start all years on Mondays. What's unnatural is the year boundary itself. So
    2014-12-30 will have a week number of 1, while 2016-01-01 will have a week number
    of 53. Engines have no standard - they may return a value between 0-53, 1-54, or 1-53
    (same as ISO's 1-53, but with potentially different boundaries).

In order to ensure that MetricFlow returns consistent results for date/time
extraction operations, we have made the following decisions:

  1. For YEAR granularities and date part values we use the calendar year.
  2. For WEEK granularities and DAY OF WEEK date part values we use the
    ISO standard week start of Monday, and the ISO standard 1-7 numbering.
  3. For the time being, we will not support WEEK OF YEAR date part requests

This PR adds some consistency testing to ensure we have consistent behavior, and
updates our extract implementation to bring all of the engines in line with our standard.

Our initial date_part implementation essentially assumed all engines
had the same behavior for various date part extraction calls, but this
is not the case, as things like day of week and year can vary from
engine to engine depending on whether or not they hew to the ISO
standards for such things.

This commit adds some engine tests for SqlExtractExpression so we
can track and enforce consistency across engine types.
Support for date_part `week` is fraught with peril, as there is
no good standard way of extracting a week number from a date.

Because we hew to the calendar standard for date_part `year`
(as opposed to using ISO year, which defines the year start as
the Monday of the week containing the first Thursday in January)
it would be odd to use ISO week of year, which is based on the
ISO year. However, no other standard exists, and so we simply
remove the property from the enum and thereby drop support for
it. We can add it later and apply a well-defined behavior to it,
if necessary.

For more details on the mayhem please refer to the docblock
changes for the DatePart enum that accompany this commit.
Databricks, DuckDB, Postgres, and Snowflake all have formal support
for extracting a day of week value that conforms to the ISO standard
of 1 (Monday) - 7 (Sunday).

This commit uses that support and enables the new ISO behavior
for those engines. BigQuery and Redshift, which require custom
rendering for day of week extraction, will be updated separately.
This completes the ISO normalization of day of week results. Now
all engines should return the ISO standard 1 (Monday) - 7 (Sunday)
values for day of week extraction.
@cla-bot cla-bot bot added the cla:yes label Oct 16, 2023
Copy link
Contributor Author

tlento commented Oct 16, 2023

Current dependencies on/for this PR:

This comment was auto-generated by Graphite.

@tlento tlento linked an issue Oct 16, 2023 that may be closed by this pull request
2 tasks
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS October 16, 2023 00:24 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS October 16, 2023 00:24 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS October 16, 2023 00:24 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS October 16, 2023 00:24 — with GitHub Actions Inactive
@tlento tlento requested a review from aiguofer October 16, 2023 00:29
Copy link
Contributor

@courtneyholcomb courtneyholcomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for getting this all sorted out! Looks good to me - left one comment but nothing blocking.

return extract_rendering_result

extract_stmt = extract_rendering_result.sql
case_expr = f"CASE WHEN {extract_stmt} = 0 THEN {extract_stmt} + 7 ELSE {extract_stmt} END"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you use IF() for some engines but CASE WHEN for others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, would it be better to define SqlExpressionNodes for IF() and CASE WHEN statements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used IF because it's shorter, but it's not ANSI standard so not all engines support it. That's probably a mistake, maybe I'll update it before merging.

As for the SqlExpressionNodes - yes, long term I think that's what we should do. We can have a SqlCaseExpressionNode that does nicely formatted case statements, that seems broadly useful. I didn't bother here because I couldn't figure out how to cleanly transform the extract node without doing it outside of the rendering layer. There are a bunch of options to think through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after getting annoyed by the visitor pattern yet again (I think I really went over the edge with #819 ) I've come to the conclusion that using a CASE expression node has to happen inside the renderer.

I'm going to just merge this as-is and then I'll follow up with a CASE statement node and see if I can hijack the rendering accordingly. If I can't make it nicely formatted I'll just stick with what's here until we push an update to reformat our SQL to dbt-like standards - given the bad formatting we have already the shorter IF seems fine for BigQuery even if it's not consistent with the CASE/WHEN we need to use for other engines.

@tlento tlento merged commit 97752a7 into main Oct 25, 2023
33 checks passed
@tlento tlento deleted the ensure-consistent-date-part-extract-results branch October 25, 2023 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] date_part/extract returns different values for DAY OF WEEK depending on engine
2 participants