-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend STLDecomposer to Support Multiseries #4253
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #4253 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 355 355
Lines 39155 39458 +303
=======================================
+ Hits 39035 39338 +303
Misses 120 120
☔ View full report in Codecov by Sentry. |
* Add unstacking function * Add stacking function * Add tests for both functions
* Squashed changes * Ignored index * Disabled column checking * Reverted deleted code * Updated pyproject.toml * Replaced version check code
1faf23d
to
4a8cc0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some smaller nits. Only reviewed implemetation so far but will review tests later.
evalml/pipelines/components/transformers/preprocessing/decomposer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making awesome progress! I left a few more comments, mostly just condensing code
evalml/pipelines/components/transformers/preprocessing/decomposer.py
Outdated
Show resolved
Hide resolved
series_y = y[id] | ||
|
||
# Determine the period of the seasonal component | ||
if id not in self.periods or self.period is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A very small use case, but users should be able to pass in self.periods
when initializing the decomposer, rather than necessarily defining it here (our detection is decent, but if a user knows what the periods should be, they should be able to pass that in for better results)
And unless I'm missing something, it doesn't look like we actually use self.period
anywhere any more, because even in the single series case, we're still indexing into self.periods
. Am I correct? If so, I think we should just swap out one for the other entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I should have taken self.period
out. I think I kept it to keep the functionality of set_period
the same, but is the function used in the PolynomialDecomposer
at all because I see that it is tested on both decomposers in test_decomposer_set_period
. Also would the periods
parameter be same for the single series case or would it be fine for period to be an integer for single series and then take in a dictionary for multiseries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set_period
is tested on both decomposers because it's implemented in the parent class, since we have to test its implementation on one subclass, why not test it on both? Feel free to bastardize whatever preexisting functions you need to - I could totally see the case for eliminating set_period
entirely and just calling _determine_periodicity
directly, or having set_period
calculate the periods for all series and save them in self.periods
instead of self.period
. It's up to you, IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to allow users to still pass in a period for both multivariate and univariate cases, I was thinking of keeping both self.period
and self.periods
as parameters, but that probably isn't the most efficient implementation so I'm open to other suggestions 😅
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_stl_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_stl_decomposer.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few final comments, but nothing blocking. Awesome work!
if self.period is None and len(y.columns) == 1: | ||
self.period = period | ||
self.update_parameters({"period": self.period}) | ||
elif self.period is not None and len(y.columns) == 1: | ||
period = self.period |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, it's ok if we call self.update_parameters
an extra time - we can collapse these into a single if len(y.columns)==1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since everything is accessing the period through self.periods
I'm thinking we might not even need to update self.period
at all, instead just use it to set period
if its given by a user
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/pipelines/components/transformers/preprocessing/stl_decomposer.py
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_decomposer.py
Outdated
Show resolved
Hide resolved
evalml/tests/component_tests/decomposer_tests/test_stl_decomposer.py
Outdated
Show resolved
Hide resolved
y = y.to_frame() | ||
series_results = {} | ||
# Iterate through each series id | ||
for id in y.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so late in the process to realize this, but do we actually need to be calling _decompose_target
in a loop? Since the decomposer handles multiseries generally, it should be able to handle multiseries here too, we can pass self.periods
through instead of period=period
, and switch around the logic to return the data in the format we need? That will prevent us from calling Decomposer.fit()
too many times, which can be slow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified this so that get_trend_dataframe
returns a dictionary for multiseries, but for single series it is still returning a list of dataframes. In the PolynomialDecomposer
, get_trend_dataframe
returns a list of dataframes. I think this is because there is some multivariate implementation written here. Since they're both using plot_decomposition
for single series, I wanted to keep the indexing the same so I left it as a list for STLDecomposer
for now (even though it can be modified later to just return a dataframe). This is pretty inconsistent so I wanted to see what others think about it and if I should consider changing the return type of get_trend_dataframe
for the multiseries and/or single series STLDecomposer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated get_trend_dataframe()
in the STLDecomposer
to return list(pd.DataFrame)
for single series and dict(list(pd.DataFrame))
for multiseries, but it should be updated in the future to no longer be in a list #4294
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the test case Becca mentioned is covered!
Resolves #4244