Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] ENH: Resample additional arrays apart from X and y #463

Open
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Aug 27, 2018

Implement the last point of #462 and should be merged after it.
Partially addressing #460

@glemaitre glemaitre changed the title EHN: resample additional arrays apart from X and y [WIP] EHN: resample additional arrays apart from X and y Aug 27, 2018
@pep8speaks
Copy link

pep8speaks commented Aug 27, 2018

Hello @glemaitre! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 29, 2018 at 09:59 Hours UTC

@glemaitre
Copy link
Member Author

Sampling extra arrays could be easy in the case of prototype selection.
One can just peak the weight of the selected samples.

However, what would be a good and meaningful default when new sample weight need to be created.

Right now, I created an *arrays sequence but we might interested to only limit to sample_weight since the creation of new instances would make sense only if we know what are we creating. Up-sampling in arrays that we don't know anything about it could be weird.

@glemaitre
Copy link
Member Author

Ups I forgot to ping @jnothman in my last comment

@codecov
Copy link

codecov bot commented Aug 29, 2018

Codecov Report

Merging #463 into master will decrease coverage by 0.35%.
The diff coverage is 94.97%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #463      +/-   ##
==========================================
- Coverage   98.92%   98.57%   -0.36%     
==========================================
  Files          85       75      -10     
  Lines        5324     4633     -691     
==========================================
- Hits         5267     4567     -700     
- Misses         57       66       +9
Impacted Files Coverage Δ
imblearn/pipeline.py 97.07% <ø> (+2.1%) ⬆️
imblearn/utils/estimator_checks.py 96.62% <100%> (-0.46%) ⬇️
imblearn/ensemble/_balance_cascade.py 100% <100%> (ø) ⬆️
...ling/_prototype_selection/_random_under_sampler.py 100% <100%> (ø) ⬆️
...nder_sampling/_prototype_selection/_tomek_links.py 100% <100%> (ø) ⬆️
...rototype_selection/_condensed_nearest_neighbour.py 100% <100%> (ø) ⬆️
imblearn/over_sampling/_random_over_sampler.py 100% <100%> (ø) ⬆️
imblearn/combine/_smote_tomek.py 100% <100%> (ø) ⬆️
...rototype_selection/_neighbourhood_cleaning_rule.py 100% <100%> (ø) ⬆️
imblearn/combine/_smote_enn.py 100% <100%> (ø) ⬆️
... and 73 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ffdde80...7a8fad0. Read the comment docs.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I don't think returning non-X,y will work with the current handling of Pipeline.fit's kwargs. We really need sample prop routing to handle that case. Currently, the handling would be unambiguous if one of:

  • the resampler is the last step, in which case we return any additional sample props like weights
  • the resampler is the second-last step, and there is no fit_param called last_step_name__sample_weight, in which case we pass all sample props into the last step's fit, I think.

Otherwise, it's unclear where to pass the returned sample_weight given that **fit_params intends to prescribe this at the time Pipeline.fit is called.

The corresponding label of `X_resampled`.

sample_weight_resampled : ndarray, shape (n_samples_new,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have a dict of non-X,y returned. (Optionally? In scikit-learn I would rather this be mandatory so we don't need to handle both cases.)

``sample_weight`` was not ``None``.

idx_resampled : ndarray, shape (n_samples_new,)
Indices of the selected features. This output is optional and only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the selected samples?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Resampled sample weights. This output is returned only if
``sample_weight`` was not ``None``.

idx_resampled : ndarray, shape (n_samples_new,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why this should be returned from fit_resample, rather than stored as an attribute?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it was some original design (before it was in scikit-learn). But actually it would be better to keep it as an attribute with the single fit_resample.

@glemaitre
Copy link
Member Author

Otherwise, it's unclear where to pass the returned sample_weight given that **fit_params intends to prescribe this at the time Pipeline.fit is called.

If I understand well and from what I could see, Pipeline does not support sample_weight right now. But in the meanwhile, do you recommend to add a fit_resample(X, y, **sample_props) signature and return a dict sample_props.

@jnothman
Copy link
Member

jnothman commented Sep 3, 2018

You're right Pipeline does not really support sample_weight now... I think supporting returning it from Pipeline.fit_resample makes sense.

@glemaitre glemaitre force-pushed the master branch 2 times, most recently from bbf2b12 to 513203c Compare September 7, 2018 13:26
@chkoar chkoar changed the title [WIP] EHN: resample additional arrays apart from X and y [WIP] ENH: Resample additional arrays apart from X and y Jun 12, 2019
@glemaitre glemaitre force-pushed the master branch 2 times, most recently from f1bc189 to 8f87307 Compare June 28, 2019 12:32
@chkoar
Copy link
Member

chkoar commented Jul 29, 2020

@glemaitre #460 is closed but #457 is still open and probably relevant. Could we close this PR in favor of new one in the future? It is two years old.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants