Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Mergeable to Simplify Merging Collections #790

Merged
merged 3 commits into from
Nov 16, 2023

Conversation

plypaul
Copy link
Contributor

@plypaul plypaul commented Sep 28, 2023

Description

This PR adds the utility class MergeBuilder that allows you to combine collections that implements Mergeable. This is useful as there are a few spec-related classes that are collections, and there are upcoming cases where they need to be merged. This replaces the collection of merge-related methods in the spec set classes.

This PR also includes a minor update to the signature of InstanceSpec to make it more flexible.

@cla-bot cla-bot bot added the cla:yes label Sep 28, 2023
@github-actions
Copy link

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@plypaul plypaul changed the title Add Mergeable / MergeBuilder. Add Mergeable / MergeBuilder to Simplify Merging Collections Sep 28, 2023
@plypaul plypaul marked this pull request as ready for review September 28, 2023 18:32
@plypaul plypaul requested a review from tlento September 28, 2023 18:32
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a vacuum where all implementation choices are reasonable and correct this all seems fine. However, we don't operate in such a vacuum, and there are a lot of "but WHY is it like this?" implementation choices in here. Every single one of them is preceded by a #noqa: D override on docstrings, and every commit message is completely devoid of the crucial "why" for each change.

Please write docstrings explaining the context behind these decisions, and then I'll be able to evaluate this better. I'm sure it's fine, but I also know that I'm going to screw up using this and when I go through the code (and commit history) I'll be quite irritated to find that nothing has been explained anywhere.

from abc import ABC, abstractmethod
from typing import Generic, Iterable, TypeVar

SelfTypeT = TypeVar("SelfTypeT", bound="Mergeable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from typing_extensions import Self is a thing. Let's use that instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't realize that was a thing - updated.

self._accumulator = initial_item

def add(self, other: MergeableT) -> None: # noqa: D
if self._accumulator is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Can the generic actually be Optional[MergeableThing]? I don't think it can, with a bound of "Mergeable" here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into some issues there, but this is no longer needed as this class was removed.


self._accumulator = self._accumulator.merge(other)

def add_all(self, *others: MergeableT) -> None: # noqa: D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I realize this is my not-Python developer side showing, but I prefer others: Sequence[...] to *others.

Given add_iterable is a thing, do we really need this method? If they can't be combined, some docstrings with explanations providing that context would be nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant as this class was removed.

return builder.build_result

@property
def build_result(self) -> MergeableT: # noqa: D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like an action rather than a property. Maybe merged_result or merged_output would be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant as this class was removed.



def test_merge_iterable() -> None: # noqa: D
assert MergeBuilder.merge_iterable(NumberList(), [NumberList([1, 2, 3])]) == NumberList([1, 2, 3])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want a test that merges a non-empty thing as well, because return other_iterable.next() passes this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test with a non-empty item, but not sure what you mean by return other_iterable.next()

entity_specs=self.entity_specs + other.entity_specs,
)

def dedupe(self) -> LinkableSpecSet: # noqa: D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we separating these? Is there ever a case where we want duplicates in this output?

If not, this feels like we're adding an old bug factory back to the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preferring to make dedupe() a separate method as it makes the Mergeable interface more flexible. I disagree on adding an old bug factory back to the codebase. If there is always de-duplication behavior, it could hide bugs in testing. e.g. the output of a merged result is checked in a test, and it's not expected for the sources of the to produce duplicate items. However, the de-duplication of the merged result hides that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two words (if not) in that bug factory sentence are extremely important. If we never want duplicates to come out of a merge, then any case where we have them is a bug, and requiring developers to manually call a dedupe method can only lead to bugs, while encapsulating that behavior can only prevent them. If there is an upstream bug that the encapsulation hides that's a separate problem, and it's one that having developers blindly call dedupe() on their merge outputs (which is a thing we should expect to happen if our contributor base grows) will also mask.

As it stands right now every merge call either explicitly requests deduped results or implicitly resolves to them, which suggests that we shouldn't force developers to call it in order to get the behavior they need.

Ultimately, this isn't a big deal, but it's interesting to me that the only extant cases where we don't call dedupe() are still cases where we need deduplicated results.

@@ -594,6 +594,37 @@ def linkable_specs(self) -> LinkableSpecSet: # noqa: D
entity_specs=self.entity_specs,
)

@override
def merge(self, other: MetricFlowQuerySpec) -> MetricFlowQuerySpec: # noqa: D
# Can't merge if both properties are set.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why can we merge when these don't match at all?

Can you please add a docstring explaining this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context on the original reason for this change was that queries for derived metrics could be thought of as a composition of queries on separate metrics during query resolution. However, this is no longer needed due to a different approach that was taken in a later revision.

Base automatically changed from plypaul--58.1--singular-spec-field-for-order-by to main October 5, 2023 20:15
@plypaul plypaul force-pushed the plypaul--58.2--add-mergable branch from 26d1c26 to 7352a38 Compare November 14, 2023 20:19
@plypaul plypaul changed the title Add Mergeable / MergeBuilder to Simplify Merging Collections Add Mergeable to Simplify Merging Collections Nov 14, 2023
@plypaul plypaul force-pushed the plypaul--58.2--add-mergable branch from 7352a38 to 2799e2f Compare November 14, 2023 20:23
Copy link
Contributor Author

@plypaul plypaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the commit messages and docstrings, and found a fix for type issues to allow removal of the MergeBuilder and just have the Mergeable interface.

@plypaul plypaul force-pushed the plypaul--58.2--add-mergable branch from 2799e2f to 809c5c6 Compare November 15, 2023 02:11
@plypaul plypaul requested a review from tlento November 15, 2023 05:57
Copy link
Contributor

@tlento tlento left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart from that last commit and the hanging If in the docstring.

You built the thing so you get to choose whether to opt for more flexibility here.

entity_specs=self.entity_specs + other.entity_specs,
)

def dedupe(self) -> LinkableSpecSet: # noqa: D
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first two words (if not) in that bug factory sentence are extremely important. If we never want duplicates to come out of a merge, then any case where we have them is a bug, and requiring developers to manually call a dedupe method can only lead to bugs, while encapsulating that behavior can only prevent them. If there is an upstream bug that the encapsulation hides that's a separate problem, and it's one that having developers blindly call dedupe() on their merge outputs (which is a thing we should expect to happen if our contributor base grows) will also mask.

As it stands right now every merge call either explicitly requests deduped results or implicitly resolves to them, which suggests that we shouldn't force developers to call it in order to get the behavior they need.

Ultimately, this isn't a big deal, but it's interesting to me that the only extant cases where we don't call dedupe() are still cases where we need deduplicated results.

mypy.ini Outdated
Comment on lines 59 to 63

# Added to resolve an issue with mypy in CI:
# ... python3.8/site-packages/referencing/jsonschema.py:426: error: Invalid "type: ignore" comment [syntax]
[mypy-referencing.jsonschema]
follow_imports = skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was fixed by updating mypy, please revert this commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

def merge_iterable(cls: Type[MergeableT], items: Iterable[MergeableT]) -> MergeableT:
"""Merge all items into a single instance.

If
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If what? 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Merging objects are frequently needed in MetricFlow as there are several
recursive operations where the output is the superset of the result of the
recursive calls.

Please see the docstring for Mergeable.
InstanceSpecSet already has a merge() call, so this updates the class to
imlement the Mergeable interface for consistency.
LinkableSpecSet already has a merge() call, so this updates the class to
imlement the Mergeable interface for consistency.
@plypaul plypaul force-pushed the plypaul--58.2--add-mergable branch from 809c5c6 to 086cee4 Compare November 16, 2023 03:53
@plypaul plypaul merged commit 50cd52d into main Nov 16, 2023
9 checks passed
@plypaul plypaul deleted the plypaul--58.2--add-mergable branch November 16, 2023 04:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants