-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Querying dimensions without metrics #804
Conversation
f84d0b6
to
c7e63cb
Compare
e4fdacc
to
71f57a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's go with this!
There are no blocking concerns but I do have a lot of small quality considerations that we should address one way or another in the very near future.
Whatever you feel is worth fixing prior to merge, please go ahead and update. I'm happy to take another look if you'd like, but I don't need to.
After merge I'll go through and file issues for whatever follow-ups we still need to make.
Thank you!
metricflow/query/query_parser.py
Outdated
read_nodes=self._read_nodes, | ||
node_output_resolver=self._node_output_resolver, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint, this must be changed in a later commit one way or the other.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean??
for read_node in read_nodes: | ||
output_data_set = node_output_resolver.get_output_data_set(read_node) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like potentially a lot of iterating through datasets to me since we'll call this function once for every partial time dimension spec, but I think it's fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would probably be better to create a dict of time dimension instances to their output datasets on init (or something like that) so we only need to do this loop once. Not sure how complicated that change will be, let me look into it and see if it's worth doing before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it turns out we are caching the output dataset so get_output_data_set()
is just a dictionary lookup after the first call for a given node!
read_nodes: Sequence[BaseOutput], | ||
node_output_resolver: DataflowPlanNodeOutputDataSetResolver, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I think it might be better to keep this totally separate from the new function and branch at the callsite for now.
There are only two callsites, one of which is in this class, and that way we aren't tempted to keep adding responsibility to the node output resolver.
What do you think? Otherwise it might be worth it to mark the new method with the leading _ for the time being, I'm not sure we'll ever really want to call it on its own if we keep this method signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not totally sure I understand what you mean. Is it this? We don't want to use the node_output_resolver
here, so we should pass in the output datasets already resolved.
I'm thinking maybe it would be better to do this all on init for this class. Iterate through all the datasets and store a dict of all time_dimension_instances
to a set of supported granularities
. Then it will be super easy to grab that at query time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ implemented it this way!
Note that Databricks is failing due to connection errors unrelated to this PR. |
Can mf query support this feature? mf query --where "name==Gerald" --explain mf --version how to add specific dimension or entity in query? |
@kingduxia use the Please use the community slack for future questions. |
Completes #SL-949
Description
Enable querying for dimensions without metrics.
If a user submits a query with dimensions and not metrics, we'll assume they want the distinct values for those dimensions.
This PR adds a new
DataFlowPlan
for distinct values that will allow those queries. The plan is essentially a simpleReadSqlSourceNode
-> optionalJoinToBaseOutputNode
->FilterElementsNode
.For simplicity's sake, I added an attribute to
FilterElementsNode
calleddistinct
to indicate if we should group by the filtered elements (therefore getting the distinct values). I updated the costing for that node to reflect an aggregation in that case.A lot of the other changes here are just renaming or restructuring things to not assume that metrics / measures are included in the query, and removing errors that enforce that requirement.
Note: something is up with the Databricks credentials, so I still need to fix that & update those snapshots.