Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying dimensions without metrics #804

Merged
merged 15 commits into from
Oct 12, 2023
Merged

Conversation

courtneyholcomb
Copy link
Contributor

@courtneyholcomb courtneyholcomb commented Oct 11, 2023

Completes #SL-949

Description

Enable querying for dimensions without metrics.
If a user submits a query with dimensions and not metrics, we'll assume they want the distinct values for those dimensions.
This PR adds a new DataFlowPlan for distinct values that will allow those queries. The plan is essentially a simple ReadSqlSourceNode -> optional JoinToBaseOutputNode -> FilterElementsNode.
For simplicity's sake, I added an attribute to FilterElementsNode called distinct to indicate if we should group by the filtered elements (therefore getting the distinct values). I updated the costing for that node to reflect an aggregation in that case.
A lot of the other changes here are just renaming or restructuring things to not assume that metrics / measures are included in the query, and removing errors that enforce that requirement.

Note: something is up with the Databricks credentials, so I still need to fix that & update those snapshots.

@cla-bot cla-bot bot added the cla:yes label Oct 11, 2023
@courtneyholcomb courtneyholcomb marked this pull request as ready for review October 11, 2023 01:42
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Inactive
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Inactive
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Failure
@tlento tlento temporarily deployed to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Inactive
@tlento tlento had a problem deploying to DW_INTEGRATION_TESTS October 11, 2023 21:16 — with GitHub Actions Failure
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's go with this!

There are no blocking concerns but I do have a lot of small quality considerations that we should address one way or another in the very near future.

Whatever you feel is worth fixing prior to merge, please go ahead and update. I'm happy to take another look if you'd like, but I don't need to.

After merge I'll go through and file issues for whatever follow-ups we still need to make.

Thank you!

metricflow/dataflow/builder/dataflow_plan_builder.py Outdated Show resolved Hide resolved
metricflow/dataflow/builder/dataflow_plan_builder.py Outdated Show resolved Hide resolved
metricflow/dataflow/builder/source_node.py Outdated Show resolved Hide resolved
metricflow/dataflow/dataflow_plan.py Show resolved Hide resolved
Comment on lines 586 to 587
read_nodes=self._read_nodes,
node_output_resolver=self._node_output_resolver,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint, this must be changed in a later commit one way or the other.....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean??

metricflow/time/time_granularity_solver.py Outdated Show resolved Hide resolved
Comment on lines 187 to 188
for read_node in read_nodes:
output_data_set = node_output_resolver.get_output_data_set(read_node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like potentially a lot of iterating through datasets to me since we'll call this function once for every partial time dimension spec, but I think it's fine for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would probably be better to create a dict of time dimension instances to their output datasets on init (or something like that) so we only need to do this loop once. Not sure how complicated that change will be, let me look into it and see if it's worth doing before merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it turns out we are caching the output dataset so get_output_data_set() is just a dictionary lookup after the first call for a given node!

metricflow/test/query/test_query_parser.py Show resolved Hide resolved
Comment on lines 134 to 135
read_nodes: Sequence[BaseOutput],
node_output_resolver: DataflowPlanNodeOutputDataSetResolver,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I think it might be better to keep this totally separate from the new function and branch at the callsite for now.

There are only two callsites, one of which is in this class, and that way we aren't tempted to keep adding responsibility to the node output resolver.

What do you think? Otherwise it might be worth it to mark the new method with the leading _ for the time being, I'm not sure we'll ever really want to call it on its own if we keep this method signature.

Copy link
Contributor Author

@courtneyholcomb courtneyholcomb Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally sure I understand what you mean. Is it this? We don't want to use the node_output_resolver here, so we should pass in the output datasets already resolved.

I'm thinking maybe it would be better to do this all on init for this class. Iterate through all the datasets and store a dict of all time_dimension_instances to a set of supported granularities. Then it will be super easy to grab that at query time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ implemented it this way!

@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS October 12, 2023 20:52 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS October 12, 2023 20:52 — with GitHub Actions Inactive
@courtneyholcomb courtneyholcomb temporarily deployed to DW_INTEGRATION_TESTS October 12, 2023 20:52 — with GitHub Actions Inactive
@courtneyholcomb
Copy link
Contributor Author

Note that Databricks is failing due to connection errors unrelated to this PR.

@courtneyholcomb courtneyholcomb merged commit ab0c7e0 into main Oct 12, 2023
20 of 22 checks passed
@courtneyholcomb courtneyholcomb deleted the court/dims-wo-metrics branch October 12, 2023 21:06
sarbmeetka pushed a commit to sarbmeetka/metricflow that referenced this pull request Nov 13, 2023
@kingduxia
Copy link

kingduxia commented Jan 3, 2024

Can mf query support this feature?

mf query --where "name==Gerald" --explain
⠙ Initiating query…
ERROR: Unable To Satisfy Query Error: Recipe not found for linkable specs: LinkableSpecSet(dimension_specs=(), time_dimension_specs=(), entity_specs=())

mf --version
mf, version 0.203.1

how to add specific dimension or entity in query?

@courtneyholcomb
Copy link
Contributor Author

@kingduxia use the --group_by flag.

Please use the community slack for future questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants