Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised EnsembleSummaryProvider interface #950

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

sigurdp
Copy link
Collaborator

@sigurdp sigurdp commented Feb 7, 2022

Refined EnsembleSummaryProvider interface to support finer control of which range of dates get returned when using lazy resampling and requesting data for multiple realizations.

Previously, when using lazy resampling, and requesting vector data for multiple realizations through EnsembleSummaryProvider.get_vectors_df(), each returned realization would contain the smalles range of dates possible that would still cover the raw data's dates.. This PR makes it possible to force the date range of all the returned realizations to be the same - either by truncating or extending the range of dates so that all realizations have an equal date range and contain the same dates.

The new behavior can be controlled through the common_date_span property of the ResamplingOptions options object that is now accepted by EnsembleSummaryProvider.get_vectors_df(). Please see documentation for ResamplingOptions for more details.

The EnsembleSummaryProvider.dates() function has also been extended with an additional argument, date_span, that controls wheter the union or intersection of dates over realizations is returned.

Note that this is a breaking change in the EnsembleSummaryProvider interface that affects the following member functions:

  • EnsembleSummaryProvider.dates()
  • EnsembleSummaryProvider.get_vectors_df()

As an example of code changes needed in the client code, consider the case where we are requesting lazy resampling of two vectors, for all realizations, with a MONTHLY resampling frequency. Currently this would be accomplished like this:

vecdf = myprovider.get_vectors_df(
    vector_names=["VECA", "VECB"],
    resampling_frequency=Frequency.MONTHLY,
    realizations=None,
)

With the revised interface, the client code would have to be changed to something like this:

vecdf = myprovider.get_vectors_df(
    vector_names=["VECA", "VECB"],
    resampling_options=ResamplingOptions(frequency=Frequency.MONTHLY, common_date_span=None),
    realizations=None,
)

Contributor checklist

  • 🤖 I have added tests, or extended existing tests, to cover any new features or bugs fixed in this PR.
  • 📖 I have considered adding a new entry in CHANGELOG.md, and added it if should be communicated there.

@sigurdp sigurdp marked this pull request as ready for review March 6, 2022 22:11
@sigurdp sigurdp requested review from asnyv and anders-kiaer March 15, 2022 08:25
@anders-kiaer anders-kiaer added the next release 🚢 To be included in next release label Mar 24, 2022
Copy link
Collaborator

@anders-kiaer anders-kiaer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ the tests! 🤖 Great code style as well - on a difficult domain technical issue. 👏

I'm not able to find much to comment on, but two questions:

  1. The pattern
    ResamplingOptions(resampling_frequency, date_span)
    if resampling_frequency
    else None
    appears to repeat itself somewhat. Should we consider adjusting ResamplingOptions to have type frequency: Optional[Frequency] such that a frequency = None can be passed directly, similar to common_date_span also being Optional (i.e. support None)? Or maybe this change is not relevant anymore after the second question:
  1. I'm not 100% sure that I remember correctly our conclusion from February, but what I think I remember is that expected logic behind the scenes (as end user), wrt. how date is handled (intersection, union or raw), is along the lines of:
    • When only individual realizations are plotted, we don't need to care about union/intersection of dates. Each realization can if we want be plotted with its own date span.
    • When any kind of statistics is calculated, we need to either take union or intersection across the dates (to get a common date axis). For vectors that are of type cumulative we can take the union (since these have a well defined extrapolation). For all other types of vectors we take the intersection (since these are unsafe to extrapolate - we could potentially also extrapolate rate vectors by assuming they are zero outside data span, but I'm not sure if we today have the metadata necessary to differentiate between rate vectors and e.g. pressure vectors? @asnyv?).

If this is the logic we want, resampling frequency is (still) a general property across all vectors accessed, but the date span (union/intersection) depends on vector type. 🤔

Do I recall incorrectly, or has something else popped up during development that forces us to reconsider this logic? Or maybe something along these lines is already implemented (and I'm just reading the code incorrectly). 😆 🙂

@@ -60,7 +64,14 @@ def create_history_vectors_df(
return pd.DataFrame()

historical_vector_names = list(historical_vector_and_vector_name_dict.keys())

# NOTE: DateSpane = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# NOTE: DateSpane = None
# NOTE: DateSpan = None

resampling_options if self._provider.supports_resampling() else None
)

# Make date span union if calculating data relative to date
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expand on this comment? Not immediately clear to me why data relative to date → we should take date unione.

@asnyv
Copy link
Collaborator

asnyv commented Mar 25, 2022

When any kind of statistics is calculated, we need to either take union or intersection across the dates (to get a common date axis). For vectors that are of type cumulative we can take the union (since these have a well defined extrapolation). For all other types of vectors we take the intersection (since these are unsafe to extrapolate - we could potentially also extrapolate rate vectors by assuming they are zero outside data span, but I'm not sure if we today have the metadata necessary to differentiate between rate vectors and e.g. pressure vectors? @asnyv?).

@anders-kiaer We don't have the metadata, but I think maybe a varying date span between statistics of different vectors could be a bit confusing for users, and that we should consider to go for the union for all vectors?

@anders-kiaer anders-kiaer removed the next release 🚢 To be included in next release label Mar 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: On Hold
Development

Successfully merging this pull request may close these issues.

4 participants