-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to FULL OUTER JOIN
for derived & ratio metrics
#842
Conversation
Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide. |
aee393c
to
3e5e8c2
Compare
@@ -441,13 +441,7 @@ FROM ( | |||
) subq_8 | |||
) subq_9 | |||
ON | |||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlento do you have context on why for INNER JOIN
we add this NULL
column comparison, but for FULL OUTER JOIN
we don't? Wondering if that's intentional or something I should fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NULL comparison is needed because NULL = NULL evaluates to NULL, which is false-y in SQL comparisons, but we have to include sets in the INNER JOIN where the dimension values on both sides are NULL. If a row doesn't match on an INNER JOIN it gets discarded, so we must handle that case in the join condition.
For FULL OUTER JOIN we can't do OR conditions (or inequalities) - a bunch of our engines don't support them. Everything has to be a strict intersection of equality checks. However, if you do a FULL OUTER JOIN and a comparison fails due to a NULL = NULL check, you still get the rows from both sides. They're just not matched to anything, so you effectively get one row per aggregated measure source. That's why we have all these MAX expressions, because that's the hack I put in to deduplicate those output rows.
I have to say I find all of this absurd, but that is where we are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, I actually wrote it down!
https://github.com/dbt-labs/metricflow/blob/main/metricflow/plan_conversion/dataflow_to_sql.py#L906
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah thank you!! Super helpful
FULL OUTER JOIN
for derived & ratio metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! If we can simplify the code a bit further let's do that in a follow-up.
@@ -441,13 +441,7 @@ FROM ( | |||
) subq_8 | |||
) subq_9 | |||
ON | |||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NULL comparison is needed because NULL = NULL evaluates to NULL, which is false-y in SQL comparisons, but we have to include sets in the INNER JOIN where the dimension values on both sides are NULL. If a row doesn't match on an INNER JOIN it gets discarded, so we must handle that case in the join condition.
For FULL OUTER JOIN we can't do OR conditions (or inequalities) - a bunch of our engines don't support them. Everything has to be a strict intersection of equality checks. However, if you do a FULL OUTER JOIN and a comparison fails due to a NULL = NULL check, you still get the rows from both sides. They're just not matched to anything, so you effectively get one row per aggregated measure source. That's why we have all these MAX expressions, because that's the hack I put in to deduplicate those output rows.
I have to say I find all of this absurd, but that is where we are.
@@ -441,13 +441,7 @@ FROM ( | |||
) subq_8 | |||
) subq_9 | |||
ON | |||
( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, I actually wrote it down!
https://github.com/dbt-labs/metricflow/blob/main/metricflow/plan_conversion/dataflow_to_sql.py#L906
|
||
In the case of a multi-data-source FULL OUTER JOIN the join key will be a coalesced set of all previously | ||
seen dimension values. For example: | ||
The join key will be a coalesced set of all previously seen dimension values. For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do this here, but if there's logic in this function for rendering the INNER JOIN case maybe we can remove the join type parameter AND that logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fairly certain the INNER JOIN case is not used at all! I can remove the join_type param before merging!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...I take it back, it's more complicated than I thought 😅 so will deal with that in a follow up!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I was just on the wrong branch. Easy to add here after all! Removed the param and all it changes is the displayed properties in the dataflow plans 👍
Resolves #SL-1172
Description
We've decided to switch to using
FULL OUTER JOIN
when combining input metrics for derived & ratio metrics. This is a step needed to avoid removing rows unexpectedly when filling nulls.Users should note that this may change the output for derived & ratio metric queries, resulting in new output
NULL
rows that previously would have been removed.