Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to FULL OUTER JOIN for derived & ratio metrics #842

Merged
merged 8 commits into from
Nov 4, 2023

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Nov 3, 2023

Resolves #SL-1172

Description

We've decided to switch to using FULL OUTER JOIN when combining input metrics for derived & ratio metrics. This is a step needed to avoid removing rows unexpectedly when filling nulls.
Users should note that this may change the output for derived & ratio metric queries, resulting in new output NULL rows that previously would have been removed.

@cla-bot cla-bot bot added the cla:yes label Nov 3, 2023
Copy link

github-actions bot commented Nov 3, 2023

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@@ -441,13 +441,7 @@ FROM (
) subq_8
) subq_9
ON
(
Copy link
Contributor Author

@courtneyholcomb courtneyholcomb Nov 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlento do you have context on why for INNER JOIN we add this NULL column comparison, but for FULL OUTER JOIN we don't? Wondering if that's intentional or something I should fix!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NULL comparison is needed because NULL = NULL evaluates to NULL, which is false-y in SQL comparisons, but we have to include sets in the INNER JOIN where the dimension values on both sides are NULL. If a row doesn't match on an INNER JOIN it gets discarded, so we must handle that case in the join condition.

For FULL OUTER JOIN we can't do OR conditions (or inequalities) - a bunch of our engines don't support them. Everything has to be a strict intersection of equality checks. However, if you do a FULL OUTER JOIN and a comparison fails due to a NULL = NULL check, you still get the rows from both sides. They're just not matched to anything, so you effectively get one row per aggregated measure source. That's why we have all these MAX expressions, because that's the hack I put in to deduplicate those output rows.

I have to say I find all of this absurd, but that is where we are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thank you!! Super helpful

@courtneyholcomb courtneyholcomb changed the title Switch to FULL OUTER JOIN for derived metrics Switch to FULL OUTER JOIN for derived & ratio metrics Nov 3, 2023
@courtneyholcomb courtneyholcomb marked this pull request as ready for review November 3, 2023 22:58
@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 3, 2023
@courtneyholcomb courtneyholcomb removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 4, 2023
@courtneyholcomb courtneyholcomb changed the title Switch to FULL OUTER JOIN for derived & ratio metrics Switch to FULL OUTER JOIN for derived & ratio metrics Nov 4, 2023
@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 4, 2023
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 4, 2023
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks! If we can simplify the code a bit further let's do that in a follow-up.

@@ -441,13 +441,7 @@ FROM (
) subq_8
) subq_9
ON
(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NULL comparison is needed because NULL = NULL evaluates to NULL, which is false-y in SQL comparisons, but we have to include sets in the INNER JOIN where the dimension values on both sides are NULL. If a row doesn't match on an INNER JOIN it gets discarded, so we must handle that case in the join condition.

For FULL OUTER JOIN we can't do OR conditions (or inequalities) - a bunch of our engines don't support them. Everything has to be a strict intersection of equality checks. However, if you do a FULL OUTER JOIN and a comparison fails due to a NULL = NULL check, you still get the rows from both sides. They're just not matched to anything, so you effectively get one row per aggregated measure source. That's why we have all these MAX expressions, because that's the hack I put in to deduplicate those output rows.

I have to say I find all of this absurd, but that is where we are.

@@ -441,13 +441,7 @@ FROM (
) subq_8
) subq_9
ON
(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


In the case of a multi-data-source FULL OUTER JOIN the join key will be a coalesced set of all previously
seen dimension values. For example:
The join key will be a coalesced set of all previously seen dimension values. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do this here, but if there's logic in this function for rendering the INNER JOIN case maybe we can remove the join type parameter AND that logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain the INNER JOIN case is not used at all! I can remove the join_type param before merging!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...I take it back, it's more complicated than I thought 😅 so will deal with that in a follow up!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was just on the wrong branch. Easy to add here after all! Removed the param and all it changes is the displayed properties in the dataflow plans 👍

@courtneyholcomb courtneyholcomb added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 4, 2023
@github-actions github-actions bot removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Nov 4, 2023
@courtneyholcomb courtneyholcomb merged commit 38adb04 into main Nov 4, 2023
21 checks passed
@courtneyholcomb courtneyholcomb deleted the court/outer branch November 4, 2023 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants