Add trino support to MetricFlow #810

sarbmeetka · 2023-10-13T14:50:15Z

Resolves #207

Description

MetricFlow should be able to generate queries against Trino. This PR add sqlclient for Trino and update test cases to support Trino connector. I have signed the CLA and added the changelog entry.

Locally, Trino is setup up using Docker container.

docker run --name trino -d -p 8080:8080 trinodb/trino

Tests were run against memory and postgres connectors in Trino. There might be changes to support every single connector supported by Trino and that should go as separate PR.

Run tests for Trino

Setup environment variable for the catalog before running tests.

# setup catalog variable before test
export DBT_ENV_SECRET_CATALOG=memory

make test-trino

Trino specific changes

There are few changes made to test cases to support Trino query generation. Below are some examples where code required changes.

Issue #1:
Trino does not implicitly handles comparison for string with timestamp literals.

Example:

-- This does not work in Trino

where ds > '2023-01-01'
or 
where ds between '2023-01-01' and '2023-02-01'

-- Needs to be

where ds > timestamp '2023-01-01'
or 
where ds between cast('2023-01-01' as timestamp) and cast('2023-02-01' as timestamp)

Issue #2:
Interval literal needs to be in quotes and date string needs to be timestamp or date type.

-- This does not work in Trino 

select '2023-01-01' - Interval 2 day;

-- Needs to be

select timestamp '2023-01-01' - Interval '2' day;

Issue #3:
Group by does not support aliases from the select

-- This is not supported
SELECT
  (bookings - ref_bookings) / CAST(bookings AS DOUBLE) AS non_referred_bookings_pct
  , metric_time__day
FROM (
  SELECT
    SUM(CASE WHEN referrer_id IS NOT NULL THEN 1 ELSE 0 END) AS ref_bookings
    , SUM(1) AS bookings
    , ds AS metric_time__day
  FROM mftest.fct_bookings
  GROUP BY
    metric_time__day
) a

-- Needs to be
SELECT
  (bookings - ref_bookings) / CAST(bookings AS DOUBLE) AS non_referred_bookings_pct
  , metric_time__day
FROM (
  SELECT
    SUM(CASE WHEN referrer_id IS NOT NULL THEN 1 ELSE 0 END) AS ref_bookings
    , SUM(1) AS bookings
    , ds AS metric_time__day
  FROM mftest.fct_bookings
  GROUP BY
    ds
) a

Issue #4:
Trino has bug with EXPLAIN CREATE TABLE where explain statement actually creates the table. This impacts dry_run function in sql_client. See trinodb/trino#130
Explain with type validate was added for Trino to avoid that issue.

tlento · 2023-10-13T17:52:35Z

Hi @sarbmeetka, thanks so much for this contribution!

This will take a bit of time to get through the review process for two reasons:

We need to get test automation configured for running our test suites against Trino. Since Trino can run in memory or on a local PostgreSQL instance you might be able to add an action step for it yourself. If you'd like to take this on you can see our Postgres and duckdb configurations and add a similar step for Trino to our sql engine test action.
Coalesce is next week and the team will be busy with events related to the conference, which may cause delays to our responses on PR updates and questions.

I'll take a pass through this later today or Monday and provide you with an initial review for this PR so that we can address any concerns and then be ready to merge as soon as all of the work we need to do on our end is complete. I'll also kick off the test suite against all currently supported engines - I suspect your changes will break our BigQuery tests because they have a non-standard approach to group by expressions. If that proves to be the case it shouldn't be too hard to fix those up to work with existing engines.

tlento

Hi @sarbmeetka , apologies for the long delay. My first experience at dbt Coalesce was great, but far more taxing than expected!

Overall, this looks great, and I very much appreciate you contributing and going through all of the test harnesses and making the relevant updates to them.

I have noted a number of inline fixes we would like to make prior to merge. To summarize, they are:

Addressing the build failure caused by the upstream rename away from SqlTimeDeltaExpression
Updating version dependencies to >=1.7.0rc0
Cleaning up the percentile handling
Ensuring the updated test cases pass on all engines (we can label the PR once everything runs) - the group by expression changes might not work in all cases. Let's start by addressing all other issues and running all of our engine tests in CI.

The last of these is by far the biggest in terms of hassle factor, so hopefully it won't be necessary.

Thanks again for this great contribution, and I look forward to seeing this working in your next update!

pyproject.toml

tlento · 2023-10-24T00:21:43Z

pyproject.toml

+	MF_TEST_ADAPTER_TYPE="trino"
+	MF_SQL_ENGINE_URL="trino://trino@localhost:8080/"


Oh neat, I'd like to see this hooked up to our CI to see if we can get it to just run this way. If so, and if it's really fast, maybe we can add it to the main unit test suite with DuckDB and Postgres. I expect Trino to be the most ANSI-compliant of all our engines so there's something nice about anchoring it in our core test suite.

dbt-metricflow/pyproject.toml

metricflow/cli/dbt_connectors/adapter_backed_client.py

metricflow/sql/render/trino.py

metricflow/test/fixtures/sql_clients/adapter_backed_ddl_client.py

metricflow/test/integration/test_cases/itest_granularity.yaml

sarbmeetka · 2023-10-24T06:15:17Z

Thanks for reviewing. I might be slow to respond as i am traveling for next couple of weeks. I will try to address those as soon i get chance.

tlento · 2023-10-24T17:17:10Z

@sarbmeetka sounds good, thanks, travel safely!

sarbmeetka · 2023-11-13T19:22:05Z

Hi @sarbmeetka , apologies for the long delay. My first experience at dbt Coalesce was great, but far more taxing than expected!

Overall, this looks great, and I very much appreciate you contributing and going through all of the test harnesses and making the relevant updates to them.

I have noted a number of inline fixes we would like to make prior to merge. To summarize, they are:

Addressing the build failure caused by the upstream rename away from SqlTimeDeltaExpression

Updating version dependencies to >=1.7.0rc0

Cleaning up the percentile handling

Ensuring the updated test cases pass on all engines (we can label the PR once everything runs) - the group by expression changes might not work in all cases. Let's start by addressing all other issues and running all of our engine tests in CI.

The last of these is by far the biggest in terms of hassle factor, so hopefully it won't be necessary.

Thanks again for this great contribution, and I look forward to seeing this working in your next update!

Thanks for the summary. Looks like there have been quite bit of changes since last time. Let me pull latest changes and work on your suggestions.

sarbmeetka · 2023-11-16T02:20:41Z

@tlento this is ready for review. I tried adding sql-engine-tests but does not look like its getting triggered yet. Let me know if i am missing anything.

tlento

Hey @sarbmeetka, thanks for the fixup! I need a couple days to get our integration test configuration up and running for Trino and then I'll do another review pass.

In the meantime I left a couple of small comments about the pyproject configuration files, but I haven't looked over the remainder of the PR just yet.

Also, there's something really wacky about the branch here - you've got a ton of extra commits that got duplicated somehow, I suspect it's something to do with merging main back to the fork and then updating here. It'd be great if you could clean that up - a clean history rewrite with just your changes would be a lot easier to read.

Thanks so much, and I'll be back here early next week!

dbt-metricflow/pyproject.toml

tlento · 2023-11-17T00:51:52Z

pyproject.toml

@@ -11,18 +11,18 @@ requires-python = ">=3.8,<3.12"
 license = "BUSL-1.1"
 keywords = []
 authors = [
-    {name = "dbt Labs"}
+    { name = "dbt Labs" }


nit: please revert the whitespace formatting changes, they're distracting and they seem to be inconsistently applied. We can apply a formatter later if it seems warranted.

.github/workflows/cd-sql-engine-tests.yaml

tlento

Hi @sarbmeetka , thanks so much for pulling this together.

On the whole the code looks good, modulo a few inline comments.

As far as the testing config goes, I have not been able to get the CI updates to work, I think there will be more to do there, and it's not really something you'll be able to test on your own.

What I'm going to try to do here is:

Duplicate this PR onto a local branch and clean up the commit history so there's a single base commit credited to you
Mess around with the CI configs on my draft PR (or possibly on a second PR on top of it)
If everything works we can get that updated PR merged, either directly or by pushing it back to your branch with a full history rewrite.

I won't get to this until after the holidays.

Please feel free to address the inline comments in the meantime. If you're able to clean up the history as well that'd be much appreciated, as it simplifies my task, but if not that's ok.

Either way, I'll update here when I make my local copy - at that point you can consider your work on this issue complete, and I'll clean up any minor issues we might uncover in testing.

Thanks again!

tlento · 2023-11-21T19:03:20Z

metricflow/test/integration/test_configured_cases.py

+        start_time: str,
+        stop_time: str,
+    ) -> str:
+        """Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'" since Trino require timestamp literals to be wrapped in a timestamp() function."""


I think this is an old comment. Let's update to something more like this:

Suggested change

"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'" since Trino require timestamp literals to be wrapped in a timestamp() function."""

"""Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'".

This will cast the literals as needed for each engine Trino, and provide an alternative to incrementing

the date as we do in render_time_constraint. Using BETWEEN is more robust for cases involving potentially

mixed granularities.

"""

Thanks for taking another look. I might get sometime tomorrow to address inline comments and try to handle rebase to pick relevant commits.

@sarbmeetka if you do that'd be great, and please feel free to split the github action changes out to a separate PR (or just remove them altogether) - that way this one would be cleaned up for merge and I can mess around with the CI configs on top of it.

Commit history should look cleaner now.

I havent changed anything on github actions yet.

metricflow/sql/render/trino.py

tlento · 2023-11-21T19:20:06Z

metricflow/sql/render/trino.py

+
+    @override
+    def render_date_part(self, date_part: DatePart) -> str:
+        """Render DATE PART for an EXTRACT expression.DAY_OF_WEEK in Trino is ISO date part to ensure all engines return consistent results."""


nit:

Suggested change

"""Render DATE PART for an EXTRACT expression.DAY_OF_WEEK in Trino is ISO date part to ensure all engines return consistent results."""

"""Render DATE PART for an EXTRACT expression.

Override DAY_OF_WEEK in Trino to ISO date part to ensure all engines return consistent results.

"""

tlento · 2023-11-21T19:21:18Z

.github/workflows/cd-sql-engine-tests.yaml

+          mf_sql_engine_url: ${{ secrets.MF_TRINO_URL }}
+          mf_sql_engine_password: ${{ secrets.MF_TRINO_PWD }}


I don't think we need these, but I have to test. The test-trino make command works on its own as long as the service is up and running, since it's doing a local service pointer the same way our postgres tests run.

tlento · 2023-11-21T19:21:55Z

.github/workflows/cd-sql-engine-tests.yaml

+    environment: DW_INTEGRATION_TESTS
+    name: Trino Tests
+    if: ${{ github.event.action != 'labeled' || github.event.label.name == 'Run Tests With Other SQL Engines' }}
+    runs-on: ubuntu-latest


I'm pretty sure we'll need to add a service container here:

https://docs.github.com/en/actions/using-containerized-services/about-service-containers

Just to be clear, I will deal with this since you don't have access to our CI runners or permission to kick off jobs.

sarbmeetka · 2023-12-07T20:33:41Z

@tlento following up on this to see if you got chance to review again with clean history?

tlento

Hi @sarbmeetka, thank you so much for your patience, I've been tied up with other commitments.

This looks great! There are three remaining things to do before we can merge:

Get this to run tests in CI both for Trino and for other engines. That's on me and I'll be working on it now, because I want to see how it goes. Assuming I get it working and everything passes the core content of the PR itself will be ready.
Get this updated onto current main so the test runner can execute on this PR directly, updating anything that turns up there
Revert any and all unnecessary changes in the various pyproject.toml files - in theory doing conflict resolution that favors whatever is in main should cover it

If you're up for one more round of updating/conflict resolution we should be able to get this merged before I go on holiday break. Otherwise I can shepherd it through in the new year.

Thank you so much for your efforts here!

pyproject.toml

tlento · 2023-12-16T01:34:39Z

metricflow/sql/render/trino.py

+    @property
+    @override
+    def expr_renderer(self) -> SqlExpressionRenderer:
+        return self.EXPR_RENDERER


These github missing newline markers annoy me. If you can, please add one here (and wherever else, there are a couple other places).

metricflow/sql/render/trino.py

tlento · 2023-12-16T02:01:10Z

Looks like the rebase is straightforward. Running tests over in #942

tlento

Ok, I got Trino running in our CI! See #942 and please note the first few commits, up to the change to the pyproject.toml, are fixes necessary here. This is NOT the configuration I will use in the end, it's just a PoC to make sure it can run. There are test failures:

https://github.com/dbt-labs/metricflow/actions/runs/7229573673/job/19700596407?pr=942

These repro in local test runs as well - basically we have some feature updates that this PR does not take into account, so we will also need to regenerate snapshots and update a couple more integration tests.

If you're willing to do those fixes based on local test runs we can get this merged and then I can open a clean PR with an updated CI configuration for us to run Trino tests with the rest of our engine-specific test suite. Otherwise I'll create a new stack of PRs and import your changes with your commit status intact, and get things merged on our end.

Thanks so much for your patience, and have a great weekend!

metricflow/test/integration/test_cases/itest_order_limit.yaml

…it castings Rebase on b6361

sarbmeetka · 2023-12-16T17:37:34Z

Ok, I got Trino running in our CI! See #942 and please note the first few commits, up to the change to the pyproject.toml, are fixes necessary here. This is NOT the configuration I will use in the end, it's just a PoC to make sure it can run. There are test failures:

https://github.com/dbt-labs/metricflow/actions/runs/7229573673/job/19700596407?pr=942

These repro in local test runs as well - basically we have some feature updates that this PR does not take into account, so we will also need to regenerate snapshots and update a couple more integration tests.

If you're willing to do those fixes based on local test runs we can get this merged and then I can open a clean PR with an updated CI configuration for us to run Trino tests with the rest of our engine-specific test suite. Otherwise I'll create a new stack of PRs and import your changes with your commit status intact, and get things merged on our end.

Thanks so much for your patience, and have a great weekend!

Thanks for responding and taking another look. I fixed some of the merge conflicts. Let me know if thats look good in your end. I have 1 test case test_cumulative_metric_with_non_adjustable_filter failing in Trino at the moment due to where clause expression since trino cannot handle implicit coversion and comparison of ts column to string. I have to spent some time here to see whats changed since it seems be added recently. Let me know if you have some suggestions on handling this.

I will appreciate if we can get these merge before holidays otherwise its seems like moving target and things break with more time passes.

sarbmeetka · 2023-12-17T03:51:46Z

All merge conflicts should be resolved and i have all test cases running successfully in local. The bigquery query should be fixed now as well.

tlento

@sarbmeetka - thanks, this looks good apart from the lint errors.

The test runner is not properly fetching the secrets stored in our test environment even though I'm the one who kicked off the tests. I suspect this is because your account does not have permission to access secrets in that actions environment. If that proves true I'll need to re-structure this into a separate PR under my account with commits authored by you, in which case I may as well run the linter while I'm at it. I'll get back to you in a bit.

One way or another we should be able to get this merged within the next day or two. Thanks again for all of your patience here!

tlento · 2023-12-18T18:38:41Z

metricflow/test/integration/query_output/test_cumulative_metrics.py

+    first_ds_expr = f"CAST('2020-03-15' AS {sql_client.sql_query_plan_renderer.expr_renderer.timestamp_data_type})"
+    second_ds_expr = f"CAST('2020-04-30' AS {sql_client.sql_query_plan_renderer.expr_renderer.timestamp_data_type})"
+    where_constraint = f"{{{{ TimeDimension('metric_time', 'day') }}}} = {first_ds_expr} or"
+    where_constraint += f" {{{{ TimeDimension('metric_time', 'day') }}}} = {second_ds_expr}"


Oh wow. Ok, Trino makes this interface issue worse, but addressing this is way outside the scope of this PR. I'll open an internal issue for us to consider here.

Yeah, but this was blocking test to succeed successfully. Wasn't sure on getting around this.

@sarbmeetka

This commit, authored by @sarbmeetka, includes all of the material changes relevant for adding support for Trino to MetricFlow. Specifically, this includes the SqlQueryPlanRenderer for Trino and integration in the CLI and unit test suite with the dbt-trino adapter. All other changes - particularly those to the contents of test queries - were made in order to bring our internal test SQL in line with Trino's ANSI-compliance requirements. The original PR (#810) commit stack has been consolidated into this one, and a separate commit with the auto-generated snapshot files for the Trino test cases. This has been done to make isolating fixes a bit simpler for CI testing prior to merge.

@sarbmeetka

This is a one-shot re-do of the snapshot generation for Trino test cases originally done throughout @sarbmeetka's PR (#810). As explained earlier, these were split out to make it easier to isolate the non-automated changes for any necessary updates in CI testing.

tlento

@sarbmeetka it turns out that it is, indeed, a problem with executing tests in our warehouse environment off of this PR. I have opened #944 in its place and put it up for review with the team. That review should be mechanical. I'm going to take the unusual step of adding an approving review here, and then closing the PR in favor of its replacement, which only really exists as a mechanism for us to be able to merge a tested version of this as soon as possible.

Thank you again for your contribution! We do not plan to run another deployment before the holidays, so look for this change to roll out with our first release of 2024!

tlento · 2023-12-19T03:19:53Z

Closing in favor of the merge-ready copy of this in #944

cla-bot bot added the cla:yes label Oct 13, 2023

tlento requested changes Oct 24, 2023

View reviewed changes

sarbmeetka marked this pull request as draft November 13, 2023 20:33

sarbmeetka closed this Nov 14, 2023

sarbmeetka reopened this Nov 14, 2023

sarbmeetka requested a review from tlento November 14, 2023 23:51

sarbmeetka marked this pull request as ready for review November 14, 2023 23:54

tlento reviewed Nov 17, 2023

View reviewed changes

tlento self-requested a review November 17, 2023 01:03

tlento reviewed Nov 21, 2023

View reviewed changes

sarbmeetka force-pushed the add-trino-support branch 2 times, most recently from 7449a26 to 2daf084 Compare November 22, 2023 23:10

sarbmeetka requested a review from tlento November 28, 2023 16:48

tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 16, 2023

tlento reviewed Dec 16, 2023

View reviewed changes

tlento removed the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 16, 2023

tlento reviewed Dec 16, 2023

View reviewed changes

metricflow/test/integration/test_cases/itest_order_limit.yaml Outdated Show resolved Hide resolved

sarbmeetka added 6 commits December 16, 2023 07:59

Add trino adapter support for MetricFlow

e0e95c1

lint

cfd11c3

Handle timestamp literals for range operator

e857500

Update test cases to handle group by for Trino in metrics calculation

16feb86

Handle dry run and approx percentile for Trino along with more explic…

72e9112

…it castings Rebase on b6361

Add changelog

bf83458

sarbmeetka added 10 commits December 16, 2023 08:06

Fix percentile function and rename sqltimedelta function in Trino client

73b71da

Handle extract for Trino

1738d09

fix lint issues

69bb34b

Fix additional test case snapshots related to extract function

b54e03b

Use positional arguments in group by to fix tests

949f3d1

Fix time delta expression for Trino and regenerate snapshot

bfd1414

Add trino to cd-sql-engine-tests

00dc026

Address function inline comments

f4df9fa

bump version for dependencies

c9d7229

Fix test cases for trino related to ts scalar expression

6034199

sarbmeetka force-pushed the add-trino-support branch from 3fe2014 to 6034199 Compare December 16, 2023 17:26

change grouping to positional param for order limit test

c009a84

Fix pending test case

4c9021a

sarbmeetka requested a review from tlento December 18, 2023 17:12

tlento added the Run Tests With Other SQL Engines Runs the test suite against the SQL engines in our target environment label Dec 18, 2023

tlento had a problem deploying to DW_INTEGRATION_TESTS December 18, 2023 18:35 — with GitHub Actions Failure

tlento reviewed Dec 18, 2023

View reviewed changes

tlento mentioned this pull request Dec 19, 2023

Add Trino Support #944

Merged

tlento approved these changes Dec 19, 2023

View reviewed changes

tlento closed this Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add trino support to MetricFlow #810

Add trino support to MetricFlow #810

sarbmeetka commented Oct 13, 2023

tlento commented Oct 13, 2023

tlento left a comment •

edited

Loading

tlento Oct 24, 2023

sarbmeetka commented Oct 24, 2023

tlento commented Oct 24, 2023

sarbmeetka commented Nov 13, 2023

sarbmeetka commented Nov 16, 2023

tlento left a comment

tlento Nov 17, 2023

tlento left a comment

tlento Nov 21, 2023

sarbmeetka Nov 22, 2023

tlento Nov 22, 2023

sarbmeetka Nov 22, 2023

sarbmeetka Nov 22, 2023

tlento Nov 21, 2023

sarbmeetka Nov 22, 2023

tlento Nov 21, 2023

tlento Nov 21, 2023

tlento Nov 21, 2023

sarbmeetka commented Dec 7, 2023

tlento left a comment

tlento Dec 16, 2023

tlento commented Dec 16, 2023

tlento left a comment

sarbmeetka commented Dec 16, 2023

sarbmeetka commented Dec 17, 2023

tlento left a comment

tlento Dec 18, 2023

sarbmeetka Dec 18, 2023

tlento left a comment

tlento commented Dec 19, 2023

		MF_TEST_ADAPTER_TYPE="trino"
		MF_SQL_ENGINE_URL="trino://trino@localhost:8080/"

-        """Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'" since Trino require timestamp literals to be wrapped in a timestamp() function."""
+        """Render an expression like "ds between timestamp '2020-01-01' AND timestamp '2020-01-02'".
+         This will cast the literals as needed for each engine Trino, and provide an alternative to incrementing
+         the date as we do in render_time_constraint. Using BETWEEN is more robust for cases involving potentially
+         mixed granularities.
+         """

		mf_sql_engine_url: ${{ secrets.MF_TRINO_URL }}
		mf_sql_engine_password: ${{ secrets.MF_TRINO_PWD }}

Add trino support to MetricFlow #810

Add trino support to MetricFlow #810

Conversation

sarbmeetka commented Oct 13, 2023

Description

tlento commented Oct 13, 2023

tlento left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarbmeetka commented Oct 24, 2023

tlento commented Oct 24, 2023

sarbmeetka commented Nov 13, 2023

sarbmeetka commented Nov 16, 2023

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarbmeetka commented Dec 7, 2023

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento commented Dec 16, 2023

tlento left a comment

Choose a reason for hiding this comment

sarbmeetka commented Dec 16, 2023

sarbmeetka commented Dec 17, 2023

tlento left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlento left a comment

Choose a reason for hiding this comment

tlento commented Dec 19, 2023

tlento left a comment •

edited

Loading