Querying dimensions without metrics #804

courtneyholcomb · 2023-10-11T00:52:20Z

Completes #SL-949

Description

Enable querying for dimensions without metrics.
If a user submits a query with dimensions and not metrics, we'll assume they want the distinct values for those dimensions.
This PR adds a new DataFlowPlan for distinct values that will allow those queries. The plan is essentially a simple ReadSqlSourceNode -> optional JoinToBaseOutputNode -> FilterElementsNode.
For simplicity's sake, I added an attribute to FilterElementsNode called distinct to indicate if we should group by the filtered elements (therefore getting the distinct values). I updated the costing for that node to reflect an aggregation in that case.
A lot of the other changes here are just renaming or restructuring things to not assume that metrics / measures are included in the query, and removing errors that enforce that requirement.

Note: something is up with the Databricks credentials, so I still need to fix that & update those snapshots.

tlento

Ok, let's go with this!

There are no blocking concerns but I do have a lot of small quality considerations that we should address one way or another in the very near future.

Whatever you feel is worth fixing prior to merge, please go ahead and update. I'm happy to take another look if you'd like, but I don't need to.

After merge I'll go through and file issues for whatever follow-ups we still need to make.

Thank you!

metricflow/dataflow/builder/dataflow_plan_builder.py

metricflow/dataflow/builder/source_node.py

metricflow/dataflow/dataflow_plan.py

tlento · 2023-10-12T18:01:36Z

metricflow/query/query_parser.py

+                    read_nodes=self._read_nodes,
+                    node_output_resolver=self._node_output_resolver,


Checkpoint, this must be changed in a later commit one way or the other.....

What do you mean??

metricflow/time/time_granularity_solver.py

tlento · 2023-10-12T18:09:48Z

metricflow/time/time_granularity_solver.py

+        for read_node in read_nodes:
+            output_data_set = node_output_resolver.get_output_data_set(read_node)


This feels like potentially a lot of iterating through datasets to me since we'll call this function once for every partial time dimension spec, but I think it's fine for now.

Yeah, it would probably be better to create a dict of time dimension instances to their output datasets on init (or something like that) so we only need to do this loop once. Not sure how complicated that change will be, let me look into it and see if it's worth doing before merge

FWIW, it turns out we are caching the output dataset so get_output_data_set() is just a dictionary lookup after the first call for a given node!

metricflow/test/query/test_query_parser.py

tlento · 2023-10-12T18:19:04Z

metricflow/time/time_granularity_solver.py

+        read_nodes: Sequence[BaseOutput],
+        node_output_resolver: DataflowPlanNodeOutputDataSetResolver,


Ok. I think it might be better to keep this totally separate from the new function and branch at the callsite for now.

There are only two callsites, one of which is in this class, and that way we aren't tempted to keep adding responsibility to the node output resolver.

What do you think? Otherwise it might be worth it to mark the new method with the leading _ for the time being, I'm not sure we'll ever really want to call it on its own if we keep this method signature.

Not totally sure I understand what you mean. Is it this? We don't want to use the node_output_resolver here, so we should pass in the output datasets already resolved.

I'm thinking maybe it would be better to do this all on init for this class. Iterate through all the datasets and store a dict of all time_dimension_instances to a set of supported granularities. Then it will be super easy to grab that at query time.

^ implemented it this way!

courtneyholcomb · 2023-10-12T21:06:27Z

Note that Databricks is failing due to connection errors unrelated to this PR.

kingduxia · 2024-01-03T16:52:37Z

Can mf query support this feature?

mf query --where "name==Gerald" --explain
⠙ Initiating query…
ERROR: Unable To Satisfy Query Error: Recipe not found for linkable specs: LinkableSpecSet(dimension_specs=(), time_dimension_specs=(), entity_specs=())

mf --version
mf, version 0.203.1

how to add specific dimension or entity in query?

courtneyholcomb · 2024-01-04T00:30:13Z

@kingduxia use the --group_by flag.

Please use the community slack for future questions.

Changelog

daf7695

cla-bot bot added the cla:yes label Oct 11, 2023

courtneyholcomb added 5 commits October 10, 2023 18:06

Write dataflow plan for distinct values

45b66dd

Find default granularity for time dimension without metrics

a1afcbb

Update dataflow to SQL plan for FilterElementsNode

4f0d5d2

Write & update tests

c9526e5

Update integration test to require entity name

c7e63cb

courtneyholcomb force-pushed the court/dims-wo-metrics branch from f84d0b6 to c7e63cb Compare October 11, 2023 01:34

Delete debug comment

71f57a0

courtneyholcomb force-pushed the court/dims-wo-metrics branch from e4fdacc to 71f57a0 Compare October 11, 2023 01:38

Update snapshots

87c566e

courtneyholcomb requested review from plypaul and tlento October 11, 2023 01:42

courtneyholcomb marked this pull request as ready for review October 11, 2023 01:42

Test fix: don't use alias in where clause

20cc790

tlento added the run_mf_sql_engine_tests label Oct 11, 2023

tlento temporarily deployed to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Inactive

tlento had a problem deploying to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Failure

tlento temporarily deployed to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Inactive

tlento had a problem deploying to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Failure

tlento approved these changes Oct 12, 2023

View reviewed changes

courtneyholcomb added 6 commits October 12, 2023 12:52

Stronger typing for read_nodes

d350fc6

Add back check for missing measure specs

9b200a0

Take min of all supported granularities for a time dim

56332d1

Separate test case for invalid group by name

61d23c6

Apply where filter before aggregating distinct values

4cea5b6

Refactor to simplify time granularity solver

722d7ea

courtneyholcomb added run_mf_sql_engine_tests and removed run_mf_sql_engine_tests labels Oct 12, 2023

courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS October 12, 2023 20:52 — with GitHub Actions Inactive

courtneyholcomb had a problem deploying to DW_INTEGRATION_TESTS October 12, 2023 20:52 — with GitHub Actions Failure

courtneyholcomb merged commit ab0c7e0 into main Oct 12, 2023
20 of 22 checks passed

courtneyholcomb deleted the court/dims-wo-metrics branch October 12, 2023 21:06

sarbmeetka pushed a commit to sarbmeetka/metricflow that referenced this pull request Nov 13, 2023

Querying dimensions without metrics (dbt-labs#804)

51a63b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Querying dimensions without metrics #804

Querying dimensions without metrics #804

courtneyholcomb commented Oct 11, 2023 •

edited

Loading

tlento left a comment

tlento Oct 12, 2023

courtneyholcomb Oct 12, 2023

tlento Oct 12, 2023

courtneyholcomb Oct 12, 2023

courtneyholcomb Oct 12, 2023

tlento Oct 12, 2023

courtneyholcomb Oct 12, 2023 •

edited

Loading

courtneyholcomb Oct 12, 2023

courtneyholcomb commented Oct 12, 2023

kingduxia commented Jan 3, 2024 •

edited

Loading

courtneyholcomb commented Jan 4, 2024

		read_nodes=self._read_nodes,
		node_output_resolver=self._node_output_resolver,

		for read_node in read_nodes:
		output_data_set = node_output_resolver.get_output_data_set(read_node)

		read_nodes: Sequence[BaseOutput],
		node_output_resolver: DataflowPlanNodeOutputDataSetResolver,

Querying dimensions without metrics #804

Querying dimensions without metrics #804

Conversation

courtneyholcomb commented Oct 11, 2023 • edited Loading

Description

tlento left a comment

Choose a reason for hiding this comment

tlento Oct 12, 2023

Choose a reason for hiding this comment

courtneyholcomb Oct 12, 2023

Choose a reason for hiding this comment

tlento Oct 12, 2023

Choose a reason for hiding this comment

courtneyholcomb Oct 12, 2023

Choose a reason for hiding this comment

courtneyholcomb Oct 12, 2023

Choose a reason for hiding this comment

tlento Oct 12, 2023

Choose a reason for hiding this comment

courtneyholcomb Oct 12, 2023 • edited Loading

Choose a reason for hiding this comment

courtneyholcomb Oct 12, 2023

Choose a reason for hiding this comment

courtneyholcomb commented Oct 12, 2023

kingduxia commented Jan 3, 2024 • edited Loading

courtneyholcomb commented Jan 4, 2024

courtneyholcomb commented Oct 11, 2023 •

edited

Loading

courtneyholcomb Oct 12, 2023 •

edited

Loading

kingduxia commented Jan 3, 2024 •

edited

Loading