Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add trino support to MetricFlow #810

Closed
wants to merge 18 commits into from

Conversation

sarbmeetka
Copy link
Contributor

Resolves #207

Description

MetricFlow should be able to generate queries against Trino. This PR add sqlclient for Trino and update test cases to support Trino connector. I have signed the CLA and added the changelog entry.

Locally, Trino is setup up using Docker container.

docker run --name trino -d -p 8080:8080 trinodb/trino

Tests were run against memory and postgres connectors in Trino. There might be changes to support every single connector supported by Trino and that should go as separate PR.

Run tests for Trino

Setup environment variable for the catalog before running tests.

# setup catalog variable before test
export DBT_ENV_SECRET_CATALOG=memory

make test-trino

Trino specific changes

There are few changes made to test cases to support Trino query generation. Below are some examples where code required changes.

Issue #1:
Trino does not implicitly handles comparison for string with timestamp literals.

Example:

-- This does not work in Trino

where ds > '2023-01-01'
or 
where ds between '2023-01-01' and '2023-02-01'

-- Needs to be

where ds > timestamp '2023-01-01'
or 
where ds between cast('2023-01-01' as timestamp) and cast('2023-02-01' as timestamp)

Issue #2:
Interval literal needs to be in quotes and date string needs to be timestamp or date type.

-- This does not work in Trino 

select '2023-01-01' - Interval 2 day;

-- Needs to be

select timestamp '2023-01-01' - Interval '2' day;

Issue #3:
Group by does not support aliases from the select

-- This is not supported
SELECT
  (bookings - ref_bookings) / CAST(bookings AS DOUBLE) AS non_referred_bookings_pct
  , metric_time__day
FROM (
  SELECT
    SUM(CASE WHEN referrer_id IS NOT NULL THEN 1 ELSE 0 END) AS ref_bookings
    , SUM(1) AS bookings
    , ds AS metric_time__day
  FROM mftest.fct_bookings
  GROUP BY
    metric_time__day
) a

-- Needs to be
SELECT
  (bookings - ref_bookings) / CAST(bookings AS DOUBLE) AS non_referred_bookings_pct
  , metric_time__day
FROM (
  SELECT
    SUM(CASE WHEN referrer_id IS NOT NULL THEN 1 ELSE 0 END) AS ref_bookings
    , SUM(1) AS bookings
    , ds AS metric_time__day
  FROM mftest.fct_bookings
  GROUP BY
    ds
) a

Issue #4:
Trino has bug with EXPLAIN CREATE TABLE where explain statement actually creates the table. This impacts dry_run function in sql_client. See trinodb/trino#130
Explain with type validate was added for Trino to avoid that issue.

@cla-bot cla-bot bot added the cla:yes label Oct 13, 2023
@tlento
Copy link
Contributor

tlento commented Oct 13, 2023

Hi @sarbmeetka, thanks so much for this contribution!

This will take a bit of time to get through the review process for two reasons:

  1. We need to get test automation configured for running our test suites against Trino. Since Trino can run in memory or on a local PostgreSQL instance you might be able to add an action step for it yourself. If you'd like to take this on you can see our Postgres and duckdb configurations and add a similar step for Trino to our sql engine test action.
  2. Coalesce is next week and the team will be busy with events related to the conference, which may cause delays to our responses on PR updates and questions.

I'll take a pass through this later today or Monday and provide you with an initial review for this PR so that we can address any concerns and then be ready to merge as soon as all of the work we need to do on our end is complete. I'll also kick off the test suite against all currently supported engines - I suspect your changes will break our BigQuery tests because they have a non-standard approach to group by expressions. If that proves to be the case it shouldn't be too hard to fix those up to work with existing engines.

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sarbmeetka , apologies for the long delay. My first experience at dbt Coalesce was great, but far more taxing than expected!

Overall, this looks great, and I very much appreciate you contributing and going through all of the test harnesses and making the relevant updates to them.

I have noted a number of inline fixes we would like to make prior to merge. To summarize, they are:

  1. Addressing the build failure caused by the upstream rename away from SqlTimeDeltaExpression
  2. Updating version dependencies to >=1.7.0rc0
  3. Cleaning up the percentile handling
  4. Ensuring the updated test cases pass on all engines (we can label the PR once everything runs) - the group by expression changes might not work in all cases. Let's start by addressing all other issues and running all of our engine tests in CI.

The last of these is by far the biggest in terms of hassle factor, so hopefully it won't be necessary.

Thanks again for this great contribution, and I look forward to seeing this working in your next update!

pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated
Comment on lines 193 to 194
MF_TEST_ADAPTER_TYPE="trino"
MF_SQL_ENGINE_URL="trino://trino@localhost:8080/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh neat, I'd like to see this hooked up to our CI to see if we can get it to just run this way. If so, and if it's really fast, maybe we can add it to the main unit test suite with DuckDB and Postgres. I expect Trino to be the most ANSI-compliant of all our engines so there's something nice about anchoring it in our core test suite.

dbt-metricflow/pyproject.toml Outdated Show resolved Hide resolved
metricflow/sql/render/trino.py Outdated Show resolved Hide resolved
metricflow/sql/render/trino.py Outdated Show resolved Hide resolved
metricflow/sql/render/trino.py Outdated Show resolved Hide resolved
@sarbmeetka
Copy link
Contributor Author

Thanks for reviewing. I might be slow to respond as i am traveling for next couple of weeks. I will try to address those as soon i get chance.

@tlento
Copy link
Contributor

tlento commented Oct 24, 2023

@sarbmeetka sounds good, thanks, travel safely!

@sarbmeetka
Copy link
Contributor Author

Hi @sarbmeetka , apologies for the long delay. My first experience at dbt Coalesce was great, but far more taxing than expected!

Overall, this looks great, and I very much appreciate you contributing and going through all of the test harnesses and making the relevant updates to them.

I have noted a number of inline fixes we would like to make prior to merge. To summarize, they are:

  1. Addressing the build failure caused by the upstream rename away from SqlTimeDeltaExpression
  2. Updating version dependencies to >=1.7.0rc0
  3. Cleaning up the percentile handling
  4. Ensuring the updated test cases pass on all engines (we can label the PR once everything runs) - the group by expression changes might not work in all cases. Let's start by addressing all other issues and running all of our engine tests in CI.

The last of these is by far the biggest in terms of hassle factor, so hopefully it won't be necessary.

Thanks again for this great contribution, and I look forward to seeing this working in your next update!

Thanks for the summary. Looks like there have been quite bit of changes since last time. Let me pull latest changes and work on your suggestions.

@sarbmeetka sarbmeetka marked this pull request as draft November 13, 2023 20:33
@sarbmeetka sarbmeetka closed this Nov 14, 2023
@sarbmeetka sarbmeetka reopened this Nov 14, 2023
@sarbmeetka sarbmeetka requested a review from tlento November 14, 2023 23:51
@sarbmeetka sarbmeetka marked this pull request as ready for review November 14, 2023 23:54
@sarbmeetka
Copy link
Contributor Author

@tlento this is ready for review. I tried adding sql-engine-tests but does not look like its getting triggered yet. Let me know if i am missing anything.

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sarbmeetka, thanks for the fixup! I need a couple days to get our integration test configuration up and running for Trino and then I'll do another review pass.

In the meantime I left a couple of small comments about the pyproject configuration files, but I haven't looked over the remainder of the PR just yet.

Also, there's something really wacky about the branch here - you've got a ton of extra commits that got duplicated somehow, I suspect it's something to do with merging main back to the fork and then updating here. It'd be great if you could clean that up - a clean history rewrite with just your changes would be a lot easier to read.

Thanks so much, and I'll be back here early next week!

dbt-metricflow/pyproject.toml Outdated Show resolved Hide resolved
pyproject.toml Outdated
@@ -11,18 +11,18 @@ requires-python = ">=3.8,<3.12"
license = "BUSL-1.1"
keywords = []
authors = [
{name = "dbt Labs"}
{ name = "dbt Labs" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please revert the whitespace formatting changes, they're distracting and they seem to be inconsistently applied. We can apply a formatter later if it seems warranted.

.github/workflows/cd-sql-engine-tests.yaml Outdated Show resolved Hide resolved
@tlento tlento self-requested a review November 17, 2023 01:03
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sarbmeetka , thanks so much for pulling this together.

On the whole the code looks good, modulo a few inline comments.

As far as the testing config goes, I have not been able to get the CI updates to work, I think there will be more to do there, and it's not really something you'll be able to test on your own.

What I'm going to try to do here is:

  1. Duplicate this PR onto a local branch and clean up the commit history so there's a single base commit credited to you
  2. Mess around with the CI configs on my draft PR (or possibly on a second PR on top of it)
  3. If everything works we can get that updated PR merged, either directly or by pushing it back to your branch with a full history rewrite.

I won't get to this until after the holidays.

Please feel free to address the inline comments in the meantime. If you're able to clean up the history as well that'd be much appreciated, as it simplifies my task, but if not that's ok.

Either way, I'll update here when I make my local copy - at that point you can consider your work on this issue complete, and I'll clean up any minor issues we might uncover in testing.

Thanks again!

start_time: str,
stop_time: str,
) -> str:
"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'" since Trino require timestamp literals to be wrapped in a timestamp() function."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is an old comment. Let's update to something more like this:

Suggested change
"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'" since Trino require timestamp literals to be wrapped in a timestamp() function."""
"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'".
This will cast the literals as needed for each engine Trino, and provide an alternative to incrementing
the date as we do in render_time_constraint. Using BETWEEN is more robust for cases involving potentially
mixed granularities.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking another look. I might get sometime tomorrow to address inline comments and try to handle rebase to pick relevant commits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarbmeetka if you do that'd be great, and please feel free to split the github action changes out to a separate PR (or just remove them altogether) - that way this one would be cleaned up for merge and I can mess around with the CI configs on top of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit history should look cleaner now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I havent changed anything on github actions yet.

metricflow/sql/render/trino.py Outdated Show resolved Hide resolved
metricflow/sql/render/trino.py Outdated Show resolved Hide resolved

@override
def render_date_part(self, date_part: DatePart) -> str:
"""Render DATE PART for an EXTRACT expression.DAY_OF_WEEK in Trino is ISO date part to ensure all engines return consistent results."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
"""Render DATE PART for an EXTRACT expression.DAY_OF_WEEK in Trino is ISO date part to ensure all engines return consistent results."""
"""Render DATE PART for an EXTRACT expression.
Override DAY_OF_WEEK in Trino to ISO date part to ensure all engines return consistent results.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 107 to 108
mf_sql_engine_url: ${{ secrets.MF_TRINO_URL }}
mf_sql_engine_password: ${{ secrets.MF_TRINO_PWD }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need these, but I have to test. The test-trino make command works on its own as long as the service is up and running, since it's doing a local service pointer the same way our postgres tests run.

environment: DW_INTEGRATION_TESTS
name: Trino Tests
if: ${{ github.event.action != 'labeled' || github.event.label.name == 'Run Tests With Other SQL Engines' }}
runs-on: ubuntu-latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we'll need to add a service container here:

https://docs.github.com/en/actions/using-containerized-services/about-service-containers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I will deal with this since you don't have access to our CI runners or permission to kick off jobs.

@sarbmeetka sarbmeetka force-pushed the add-trino-support branch 2 times, most recently from 7449a26 to 2daf084 Compare November 22, 2023 23:10
@sarbmeetka sarbmeetka requested a review from tlento November 28, 2023 16:48
@sarbmeetka
Copy link
Contributor Author

@tlento following up on this to see if you got chance to review again with clean history?

@tlento tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 16, 2023
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sarbmeetka, thank you so much for your patience, I've been tied up with other commitments.

This looks great! There are three remaining things to do before we can merge:

  1. Get this to run tests in CI both for Trino and for other engines. That's on me and I'll be working on it now, because I want to see how it goes. Assuming I get it working and everything passes the core content of the PR itself will be ready.
  2. Get this updated onto current main so the test runner can execute on this PR directly, updating anything that turns up there
  3. Revert any and all unnecessary changes in the various pyproject.toml files - in theory doing conflict resolution that favors whatever is in main should cover it

If you're up for one more round of updating/conflict resolution we should be able to get this merged before I go on holiday break. Otherwise I can shepherd it through in the new year.

Thank you so much for your efforts here!

pyproject.toml Outdated Show resolved Hide resolved
@property
@override
def expr_renderer(self) -> SqlExpressionRenderer:
return self.EXPR_RENDERER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These github missing newline markers annoy me. If you can, please add one here (and wherever else, there are a couple other places).

metricflow/sql/render/trino.py Show resolved Hide resolved
@tlento
Copy link
Contributor

tlento commented Dec 16, 2023

Looks like the rebase is straightforward. Running tests over in #942

@tlento tlento removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 16, 2023
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I got Trino running in our CI! See #942 and please note the first few commits, up to the change to the pyproject.toml, are fixes necessary here. This is NOT the configuration I will use in the end, it's just a PoC to make sure it can run. There are test failures:

https://github.com/dbt-labs/metricflow/actions/runs/7229573673/job/19700596407?pr=942

These repro in local test runs as well - basically we have some feature updates that this PR does not take into account, so we will also need to regenerate snapshots and update a couple more integration tests.

If you're willing to do those fixes based on local test runs we can get this merged and then I can open a clean PR with an updated CI configuration for us to run Trino tests with the rest of our engine-specific test suite. Otherwise I'll create a new stack of PRs and import your changes with your commit status intact, and get things merged on our end.

Thanks so much for your patience, and have a great weekend!

@sarbmeetka
Copy link
Contributor Author

Ok, I got Trino running in our CI! See #942 and please note the first few commits, up to the change to the pyproject.toml, are fixes necessary here. This is NOT the configuration I will use in the end, it's just a PoC to make sure it can run. There are test failures:

https://github.com/dbt-labs/metricflow/actions/runs/7229573673/job/19700596407?pr=942

These repro in local test runs as well - basically we have some feature updates that this PR does not take into account, so we will also need to regenerate snapshots and update a couple more integration tests.

If you're willing to do those fixes based on local test runs we can get this merged and then I can open a clean PR with an updated CI configuration for us to run Trino tests with the rest of our engine-specific test suite. Otherwise I'll create a new stack of PRs and import your changes with your commit status intact, and get things merged on our end.

Thanks so much for your patience, and have a great weekend!

Thanks for responding and taking another look. I fixed some of the merge conflicts. Let me know if thats look good in your end. I have 1 test case test_cumulative_metric_with_non_adjustable_filter failing in Trino at the moment due to where clause expression since trino cannot handle implicit coversion and comparison of ts column to string. I have to spent some time here to see whats changed since it seems be added recently. Let me know if you have some suggestions on handling this.

I will appreciate if we can get these merge before holidays otherwise its seems like moving target and things break with more time passes.

@sarbmeetka
Copy link
Contributor Author

All merge conflicts should be resolved and i have all test cases running successfully in local. The bigquery query should be fixed now as well.
image

@sarbmeetka sarbmeetka requested a review from tlento December 18, 2023 17:12
@tlento tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 18, 2023
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS December 18, 2023 18:35 — with GitHub Actions Failure
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS December 18, 2023 18:35 — with GitHub Actions Failure
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS December 18, 2023 18:35 — with GitHub Actions Failure
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS December 18, 2023 18:35 — with GitHub Actions Failure
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarbmeetka - thanks, this looks good apart from the lint errors.

The test runner is not properly fetching the secrets stored in our test environment even though I'm the one who kicked off the tests. I suspect this is because your account does not have permission to access secrets in that actions environment. If that proves true I'll need to re-structure this into a separate PR under my account with commits authored by you, in which case I may as well run the linter while I'm at it. I'll get back to you in a bit.

One way or another we should be able to get this merged within the next day or two. Thanks again for all of your patience here!

Comment on lines +138 to +141
first_ds_expr = f"CAST('2020-03-15' AS {sql_client.sql_query_plan_renderer.expr_renderer.timestamp_data_type})"
second_ds_expr = f"CAST('2020-04-30' AS {sql_client.sql_query_plan_renderer.expr_renderer.timestamp_data_type})"
where_constraint = f"{{{{ TimeDimension('metric_time', 'day') }}}} = {first_ds_expr} or"
where_constraint += f" {{{{ TimeDimension('metric_time', 'day') }}}} = {second_ds_expr}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow. Ok, Trino makes this interface issue worse, but addressing this is way outside the scope of this PR. I'll open an internal issue for us to consider here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but this was blocking test to succeed successfully. Wasn't sure on getting around this.

tlento pushed a commit that referenced this pull request Dec 19, 2023
This commit, authored by @sarbmeetka, includes all of the material
changes relevant for adding support for Trino to MetricFlow.

Specifically, this includes the SqlQueryPlanRenderer for Trino and
integration in the CLI and unit test suite with the dbt-trino
adapter. All other changes - particularly those to the contents
of test queries - were made in order to bring our internal test
SQL in line with Trino's ANSI-compliance requirements.

The original PR (#810)
commit stack has been consolidated into this one, and a separate
commit with the auto-generated snapshot files for the Trino test
cases. This has been done to make isolating fixes a bit simpler
for CI testing prior to merge.
tlento added a commit that referenced this pull request Dec 19, 2023
This is a one-shot re-do of the snapshot generation for Trino
test cases originally done throughout @sarbmeetka's PR
(#810).

As explained earlier, these were split out to make it easier to
isolate the non-automated changes for any necessary updates in
CI testing.
@tlento tlento mentioned this pull request Dec 19, 2023
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarbmeetka it turns out that it is, indeed, a problem with executing tests in our warehouse environment off of this PR. I have opened #944 in its place and put it up for review with the team. That review should be mechanical. I'm going to take the unusual step of adding an approving review here, and then closing the PR in favor of its replacement, which only really exists as a mechanism for us to be able to merge a tested version of this as soon as possible.

Thank you again for your contribution! We do not plan to run another deployment before the holidays, so look for this change to roll out with our first release of 2024!

@tlento
Copy link
Contributor

tlento commented Dec 19, 2023

Closing in favor of the merge-ready copy of this in #944

@tlento tlento closed this Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Trino
2 participants