Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Saved Queries #794

Merged
merged 3 commits into from
Oct 24, 2023
Merged

Support Saved Queries #794

merged 3 commits into from
Oct 24, 2023

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Oct 5, 2023

Resolves #765

Description

Please see the linked issue for more details. As this PR depends on dbt-semantic-interfaces~=0.3.0a2 which includes the interface definition for saved queries, and it's currently not possible to specify that dependency due to a conflict with dbt-core~=1.6.0 as required by this package. Once dbt-core is updated, this PR will be updated so that CI checks can pass. Locally, a custom pyproject.yaml was used to test this feature.

@plypaul plypaul requested a review from tlento October 5, 2023 05:39
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable first cut.

How were you planning to address the parameterizable where filter? What we really want out of a saved query is something where the filter is more like:

saved_queries:
  - name: date_filter_example
    filter: metric_time BETWEEN <start_date> AND <end_date>

Then on invocation we want people to be able to do something like:

mf query --saved-query date_filter_example --params "start_date='2021-01-01',end_date='2021-01-05'"

@@ -46,7 +46,8 @@ def query_options(function: Callable) -> Callable:
)(function)
function = click.option(
"--metrics",
type=click_custom.SequenceParamType(min_length=1),
# Validity checks for this parameter was moved to the MetricFlowEngine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for now, this is better:

Suggested change
# Validity checks for this parameter was moved to the MetricFlowEngine.
# Validation is handled in the MetricFlowEngine

Longer term I'd like to build out a formal API that manages all of this stuff and just make the CLI interface an example caller of that API. The MetricFlowEngine is a little too much of a mixture between internal classes and public-facing things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -84,6 +85,8 @@ class MetricFlowQueryType(Enum):
class MetricFlowQueryRequest:
"""Encapsulates the parameters for a metric query.

TODO: This has turned into a bag of parameters that make it difficult to use without a bunch of conditionals.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course it has, that's what always happens! Where's the lolsob emoji when I need it....

Comment on lines -133 to -139
assert_exactly_one_arg_set(metric_names=metric_names, metrics=metrics)
assert not (
group_by_names and group_by
), "Both group_by_names and group_by were set, but if a group by is specified you should only use one of these!"
assert not (
order_by_names and order_by
), "Both order_by_names and order_by were set, but if an order by is specified you should only use one of these!"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing these? Is this a redundant validation? If it isn't we should leave them in, because we're in this ugly transitional state where we accept both type inputs via different parameters instead of either merging to a Union or pushing that to an outer interface where everything can be normalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a redundant validation that's handled in MetricFlowQueryParser.

"""
saved_query = self._get_saved_query(saved_query_parameter)

# This logic could be encapsulated in the WhereFilter through a merge interface.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and even more so now that the object is going to hold a list of WhereFilters.

I haven't added it yet, but it might be something we should add in to the base protocol. If we don't think that's worth doing we can re-type everything internally to take an MFWhereFilter that just extends the WhereFilter protocol with merge or combine or whatever.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could add Mergeable to the base class, but we can visit that once Mergeable is in.

Comment on lines 206 to 209
where_conditions_with_parenthesis = tuple(f"({where_condition})" for where_condition in where_conditions)
combined_where_filter = PydanticWhereFilter(
where_sql_template=" AND ".join(where_conditions_with_parenthesis)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eeeewwwwww......

I was thinking about eventually moving this to the renderers, since we won't always render the combined filter. The container class I created should help with this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container class helped, but I didn't see a place to render it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking eventually we'd just use the container and pass it all the way through but maybe that's not the right way to do it. What you did in #809 seems reasonable for now.

handles operations related to the object-builder naming scheme.

Additional issues:
* The call parameter sets in DSI does not support date part.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also not support date part in saved queries, it's a partially supported feature built solely for use in interactive sessions through the Tableau connector. As we get more experience with how the date_part mechanics are used in practice we'll push more robust support down to MetricFlow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me - updated comment.

from metricflow.specs.query_param_implementations import DimensionOrEntityParameter, TimeDimensionParameter


def parse_object_builder_naming_scheme(group_by_item_name: str) -> GroupByParameter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much of this still applies and what's been improved. Agreed we should check in with @DevonFulcher, probably worth doing that before merge since you'll need to update some stuff due to me mucking about with WhereFilter interfaces in dbt-semantic-interfaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry I am just catching up here. In general, I agree that query interface objects should share more code. Now that they all share an interface, this should be easier. I'm happy to chat about that more about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatted with @DevonFulcher - using this for now until an improved setup is ready.

"""
try:
call_parameter_sets = PydanticWhereFilter(
where_sql_template="{{ " + group_by_item_name + " }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for group_by_item_name to contain malicious code? We should consider validating the string before calling this constructor. That is probably a concern of MFS though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mainly for the CLI, and verified with @DevonFulcher that it's handled on the MFS side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should all go away when we finish cleaning up the parameter parsing stuff, right?

@plypaul plypaul force-pushed the plypaul--57--saved-query4 branch from 402e286 to a800cb9 Compare October 12, 2023 21:43
@cla-bot cla-bot bot added the cla:yes label Oct 12, 2023
@plypaul plypaul changed the base branch from main to plypaul--59--update-dsi October 12, 2023 21:44
@plypaul plypaul force-pushed the plypaul--57--saved-query4 branch from a800cb9 to 56eaf0c Compare October 12, 2023 22:07
@plypaul plypaul marked this pull request as ready for review October 12, 2023 22:10
@plypaul plypaul force-pushed the plypaul--57--saved-query4 branch from 56eaf0c to 1361f70 Compare October 12, 2023 22:15
Base automatically changed from plypaul--59--update-dsi to main October 13, 2023 00:27
@plypaul
Copy link
Contributor Author

plypaul commented Oct 14, 2023

How were you planning to address the parameterizable where filter?

In this initial version, there is support for additional where filters when running a saved query, so users would have to pass in an appropriately constructed where filter to do something like - run the saved query for a different set of dates. We haven't discussed nor designed the parameterization interface, so that will have to be handled later.

@plypaul plypaul requested a review from tlento October 16, 2023 17:58
metrics:
- bookings
- instant_bookings
group_bys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plypaul should this be group_by?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a spec change? This is following #765

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a typo in the spec. We use group_by everywhere else, so it should be group_by for saved queries.

We used to call them group_bys but it was horribly confusing for everybody so we switched to group_by.

group_bys:
- TimeDimension('metric_time', 'day')
- Dimension('listing__capacity_latest')
where:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plypaul I think this should be filter to keep it consistent with metrics.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jstein77 we decided that where was what we'd use for saved queries because they're queries, and where matches the query interfaces for the CLI, JDBC and GraphQL.

filter is what we use for metrics and other non-query objects in the spec.

I'd prefer to pick one and use it everywhere, but alas that ship has sailed. Or we can go with filter here and commit to changing the other query interfaces.

One argument in favor of where - we may need to introduce a post-aggregation filter (a having input) for query time specification and that can be a natural distinction for people accustomed to SQL. That said, I don't like calling anything having because it's a goofy name for a filter expression.

Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think this is good to merge at this point, but it seems like we need a dbt-semantic-interfaces 0.3.1 bugfix to roll out for the group_by thing.

@@ -46,7 +46,8 @@ def query_options(function: Callable) -> Callable:
)(function)
function = click.option(
"--metrics",
type=click_custom.SequenceParamType(min_length=1),
# Validation is handled in the MetricFlowEngine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this comment, especially since we're removing this requirement anyway.

"""
try:
call_parameter_sets = PydanticWhereFilter(
where_sql_template="{{ " + group_by_item_name + " }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this should all go away when we finish cleaning up the parameter parsing stuff, right?

metrics:
- bookings
- instant_bookings
group_bys:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a typo in the spec. We use group_by everywhere else, so it should be group_by for saved queries.

We used to call them group_bys but it was horribly confusing for everybody so we switched to group_by.

group_bys:
- TimeDimension('metric_time', 'day')
- Dimension('listing__capacity_latest')
where:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jstein77 we decided that where was what we'd use for saved queries because they're queries, and where matches the query interfaces for the CLI, JDBC and GraphQL.

filter is what we use for metrics and other non-query objects in the spec.

I'd prefer to pick one and use it everywhere, but alas that ship has sailed. Or we can go with filter here and commit to changing the other query interfaces.

One argument in favor of where - we may need to introduce a post-aggregation filter (a having input) for query time specification and that can be a natural distinction for people accustomed to SQL. That said, I don't like calling anything having because it's a goofy name for a filter expression.

@plypaul plypaul force-pushed the plypaul--57--saved-query4 branch from 1361f70 to 9daf141 Compare October 24, 2023 05:41
@plypaul plypaul force-pushed the plypaul--57--saved-query4 branch from 9daf141 to 41679b9 Compare October 24, 2023 05:48
@plypaul plypaul merged commit e7ccd73 into main Oct 24, 2023
6 checks passed
@plypaul plypaul deleted the plypaul--57--saved-query4 branch October 24, 2023 05:56
@tlento
Copy link
Contributor

tlento commented Oct 25, 2023

This broke the engine tests. Also Redshift is no longer connecting, so I can't just fix them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[SL-915] [Feature] Support Saved Queries
4 participants