-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add trino support to MetricFlow #810
Conversation
Hi @sarbmeetka, thanks so much for this contribution! This will take a bit of time to get through the review process for two reasons:
I'll take a pass through this later today or Monday and provide you with an initial review for this PR so that we can address any concerns and then be ready to merge as soon as all of the work we need to do on our end is complete. I'll also kick off the test suite against all currently supported engines - I suspect your changes will break our BigQuery tests because they have a non-standard approach to group by expressions. If that proves to be the case it shouldn't be too hard to fix those up to work with existing engines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sarbmeetka , apologies for the long delay. My first experience at dbt Coalesce was great, but far more taxing than expected!
Overall, this looks great, and I very much appreciate you contributing and going through all of the test harnesses and making the relevant updates to them.
I have noted a number of inline fixes we would like to make prior to merge. To summarize, they are:
- Addressing the build failure caused by the upstream rename away from SqlTimeDeltaExpression
- Updating version dependencies to >=1.7.0rc0
- Cleaning up the percentile handling
- Ensuring the updated test cases pass on all engines (we can label the PR once everything runs) - the group by expression changes might not work in all cases. Let's start by addressing all other issues and running all of our engine tests in CI.
The last of these is by far the biggest in terms of hassle factor, so hopefully it won't be necessary.
Thanks again for this great contribution, and I look forward to seeing this working in your next update!
pyproject.toml
Outdated
MF_TEST_ADAPTER_TYPE="trino" | ||
MF_SQL_ENGINE_URL="trino://trino@localhost:8080/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh neat, I'd like to see this hooked up to our CI to see if we can get it to just run this way. If so, and if it's really fast, maybe we can add it to the main unit test suite with DuckDB and Postgres. I expect Trino to be the most ANSI-compliant of all our engines so there's something nice about anchoring it in our core test suite.
Thanks for reviewing. I might be slow to respond as i am traveling for next couple of weeks. I will try to address those as soon i get chance. |
@sarbmeetka sounds good, thanks, travel safely! |
Thanks for the summary. Looks like there have been quite bit of changes since last time. Let me pull latest changes and work on your suggestions. |
@tlento this is ready for review. I tried adding sql-engine-tests but does not look like its getting triggered yet. Let me know if i am missing anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sarbmeetka, thanks for the fixup! I need a couple days to get our integration test configuration up and running for Trino and then I'll do another review pass.
In the meantime I left a couple of small comments about the pyproject configuration files, but I haven't looked over the remainder of the PR just yet.
Also, there's something really wacky about the branch here - you've got a ton of extra commits that got duplicated somehow, I suspect it's something to do with merging main back to the fork and then updating here. It'd be great if you could clean that up - a clean history rewrite with just your changes would be a lot easier to read.
Thanks so much, and I'll be back here early next week!
pyproject.toml
Outdated
@@ -11,18 +11,18 @@ requires-python = ">=3.8,<3.12" | |||
license = "BUSL-1.1" | |||
keywords = [] | |||
authors = [ | |||
{name = "dbt Labs"} | |||
{ name = "dbt Labs" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please revert the whitespace formatting changes, they're distracting and they seem to be inconsistently applied. We can apply a formatter later if it seems warranted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sarbmeetka , thanks so much for pulling this together.
On the whole the code looks good, modulo a few inline comments.
As far as the testing config goes, I have not been able to get the CI updates to work, I think there will be more to do there, and it's not really something you'll be able to test on your own.
What I'm going to try to do here is:
- Duplicate this PR onto a local branch and clean up the commit history so there's a single base commit credited to you
- Mess around with the CI configs on my draft PR (or possibly on a second PR on top of it)
- If everything works we can get that updated PR merged, either directly or by pushing it back to your branch with a full history rewrite.
I won't get to this until after the holidays.
Please feel free to address the inline comments in the meantime. If you're able to clean up the history as well that'd be much appreciated, as it simplifies my task, but if not that's ok.
Either way, I'll update here when I make my local copy - at that point you can consider your work on this issue complete, and I'll clean up any minor issues we might uncover in testing.
Thanks again!
start_time: str, | ||
stop_time: str, | ||
) -> str: | ||
"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'" since Trino require timestamp literals to be wrapped in a timestamp() function.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an old comment. Let's update to something more like this:
"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'" since Trino require timestamp literals to be wrapped in a timestamp() function.""" | |
"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'". | |
This will cast the literals as needed for each engine Trino, and provide an alternative to incrementing | |
the date as we do in render_time_constraint. Using BETWEEN is more robust for cases involving potentially | |
mixed granularities. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking another look. I might get sometime tomorrow to address inline comments and try to handle rebase to pick relevant commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarbmeetka if you do that'd be great, and please feel free to split the github action changes out to a separate PR (or just remove them altogether) - that way this one would be cleaned up for merge and I can mess around with the CI configs on top of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit history should look cleaner now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I havent changed anything on github actions yet.
metricflow/sql/render/trino.py
Outdated
|
||
@override | ||
def render_date_part(self, date_part: DatePart) -> str: | ||
"""Render DATE PART for an EXTRACT expression.DAY_OF_WEEK in Trino is ISO date part to ensure all engines return consistent results.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
"""Render DATE PART for an EXTRACT expression.DAY_OF_WEEK in Trino is ISO date part to ensure all engines return consistent results.""" | |
"""Render DATE PART for an EXTRACT expression. | |
Override DAY_OF_WEEK in Trino to ISO date part to ensure all engines return consistent results. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
mf_sql_engine_url: ${{ secrets.MF_TRINO_URL }} | ||
mf_sql_engine_password: ${{ secrets.MF_TRINO_PWD }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need these, but I have to test. The test-trino make command works on its own as long as the service is up and running, since it's doing a local service pointer the same way our postgres tests run.
environment: DW_INTEGRATION_TESTS | ||
name: Trino Tests | ||
if: ${{ github.event.action != 'labeled' || github.event.label.name == 'Run Tests With Other SQL Engines' }} | ||
runs-on: ubuntu-latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure we'll need to add a service container here:
https://docs.github.com/en/actions/using-containerized-services/about-service-containers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I will deal with this since you don't have access to our CI runners or permission to kick off jobs.
7449a26
to
2daf084
Compare
@tlento following up on this to see if you got chance to review again with clean history? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sarbmeetka, thank you so much for your patience, I've been tied up with other commitments.
This looks great! There are three remaining things to do before we can merge:
- Get this to run tests in CI both for Trino and for other engines. That's on me and I'll be working on it now, because I want to see how it goes. Assuming I get it working and everything passes the core content of the PR itself will be ready.
- Get this updated onto current main so the test runner can execute on this PR directly, updating anything that turns up there
- Revert any and all unnecessary changes in the various pyproject.toml files - in theory doing conflict resolution that favors whatever is in main should cover it
If you're up for one more round of updating/conflict resolution we should be able to get this merged before I go on holiday break. Otherwise I can shepherd it through in the new year.
Thank you so much for your efforts here!
@property | ||
@override | ||
def expr_renderer(self) -> SqlExpressionRenderer: | ||
return self.EXPR_RENDERER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These github missing newline markers annoy me. If you can, please add one here (and wherever else, there are a couple other places).
Looks like the rebase is straightforward. Running tests over in #942 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I got Trino running in our CI! See #942 and please note the first few commits, up to the change to the pyproject.toml, are fixes necessary here. This is NOT the configuration I will use in the end, it's just a PoC to make sure it can run. There are test failures:
https://github.com/dbt-labs/metricflow/actions/runs/7229573673/job/19700596407?pr=942
These repro in local test runs as well - basically we have some feature updates that this PR does not take into account, so we will also need to regenerate snapshots and update a couple more integration tests.
If you're willing to do those fixes based on local test runs we can get this merged and then I can open a clean PR with an updated CI configuration for us to run Trino tests with the rest of our engine-specific test suite. Otherwise I'll create a new stack of PRs and import your changes with your commit status intact, and get things merged on our end.
Thanks so much for your patience, and have a great weekend!
…it castings Rebase on b6361
3fe2014
to
6034199
Compare
Thanks for responding and taking another look. I fixed some of the merge conflicts. Let me know if thats look good in your end. I have 1 test case I will appreciate if we can get these merge before holidays otherwise its seems like moving target and things break with more time passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarbmeetka - thanks, this looks good apart from the lint errors.
The test runner is not properly fetching the secrets stored in our test environment even though I'm the one who kicked off the tests. I suspect this is because your account does not have permission to access secrets in that actions environment. If that proves true I'll need to re-structure this into a separate PR under my account with commits authored by you, in which case I may as well run the linter while I'm at it. I'll get back to you in a bit.
One way or another we should be able to get this merged within the next day or two. Thanks again for all of your patience here!
first_ds_expr = f"CAST('2020-03-15' AS {sql_client.sql_query_plan_renderer.expr_renderer.timestamp_data_type})" | ||
second_ds_expr = f"CAST('2020-04-30' AS {sql_client.sql_query_plan_renderer.expr_renderer.timestamp_data_type})" | ||
where_constraint = f"{{{{ TimeDimension('metric_time', 'day') }}}} = {first_ds_expr} or" | ||
where_constraint += f" {{{{ TimeDimension('metric_time', 'day') }}}} = {second_ds_expr}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow. Ok, Trino makes this interface issue worse, but addressing this is way outside the scope of this PR. I'll open an internal issue for us to consider here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but this was blocking test to succeed successfully. Wasn't sure on getting around this.
This commit, authored by @sarbmeetka, includes all of the material changes relevant for adding support for Trino to MetricFlow. Specifically, this includes the SqlQueryPlanRenderer for Trino and integration in the CLI and unit test suite with the dbt-trino adapter. All other changes - particularly those to the contents of test queries - were made in order to bring our internal test SQL in line with Trino's ANSI-compliance requirements. The original PR (#810) commit stack has been consolidated into this one, and a separate commit with the auto-generated snapshot files for the Trino test cases. This has been done to make isolating fixes a bit simpler for CI testing prior to merge.
This is a one-shot re-do of the snapshot generation for Trino test cases originally done throughout @sarbmeetka's PR (#810). As explained earlier, these were split out to make it easier to isolate the non-automated changes for any necessary updates in CI testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarbmeetka it turns out that it is, indeed, a problem with executing tests in our warehouse environment off of this PR. I have opened #944 in its place and put it up for review with the team. That review should be mechanical. I'm going to take the unusual step of adding an approving review here, and then closing the PR in favor of its replacement, which only really exists as a mechanism for us to be able to merge a tested version of this as soon as possible.
Thank you again for your contribution! We do not plan to run another deployment before the holidays, so look for this change to roll out with our first release of 2024!
Closing in favor of the merge-ready copy of this in #944 |
Resolves #207
Description
MetricFlow should be able to generate queries against Trino. This PR add sqlclient for Trino and update test cases to support Trino connector. I have signed the CLA and added the changelog entry.
Locally, Trino is setup up using Docker container.
Tests were run against
memory
andpostgres
connectors in Trino. There might be changes to support every single connector supported by Trino and that should go as separate PR.Run tests for Trino
Setup environment variable for the catalog before running tests.
Trino specific changes
There are few changes made to test cases to support Trino query generation. Below are some examples where code required changes.
Issue #1:
Trino does not implicitly handles comparison for string with timestamp literals.
Example:
Issue #2:
Interval literal needs to be in quotes and date string needs to be timestamp or date type.
Issue #3:
Group by does not support aliases from the select
Issue #4:
Trino has bug with
EXPLAIN CREATE TABLE
where explain statement actually creates the table. This impactsdry_run
function in sql_client. See trinodb/trino#130Explain with type validate was added for Trino to avoid that issue.