Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call _validate_steps in test_validate_steps #1307

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

danilobellini
Copy link

This is a required step as Pipeline._validate_steps method is not called by scikit-learn during construction, but it's expected by tests. I tried running it with scikit-learn v1.2.2 and scikit-learn v1.3.0. I found that in commit 0110921 for v1.1.0 that validation was removed upstream from Pipeline.__init__.

I'm currently the maintainer of the yellowbrick's AUR package and I'm having some trouble to update the package to v1.5.0 because of tests. The mentioned test is one of the failing tests, I brought it here in this pull request because it was simple to fix.

I've called black and pytest for the resulting file.

This is a required step as that method is not called by scikit-learn
during pipeline construction
@lwgray
Copy link
Contributor

lwgray commented Jul 20, 2023

@bbengfort Will we need to upgrade to scikit v1.3.0. I worry about calling ._validate_steps()

The _validate_steps method in a scikit-learn Pipeline is a private method used to check whether the steps of the pipeline are defined correctly. In the pipeline, the steps should be structured such that all steps up to the final one should be transformers (i.e., they should have a fit and transform method), and the final step should be an estimator (i.e., it should have a fit method).

Calling _validate_steps() explicitly in your test cases will make sure that this validation is performed at the moment you define the pipeline, rather than later when you try to fit or transform data with the pipeline.

In @danilobellini 's code, adding _validate_steps() after the Pipeline or VisualPipeline instantiation will cause the validation to happen immediately. This means that if there's a problem with the steps (e.g., a non-transformer object in an intermediate step, or a non-estimator object as the final step), a TypeError will be raised immediately, rather than later on when you try to use the pipeline.

This could make the tests clearer and more direct, as he is specifically testing the validation of the pipeline steps, and it's useful to have that validation happen as explicitly and immediately as possible. However, I'm aware that _validate_steps is a private method (indicated by the leading underscore), which means that it's not part of the public API of the Pipeline class and could potentially change in future versions of scikit-learn. Using private methods can sometimes lead to less stable code, as they're not guaranteed to stay the same in the way that public methods are.

@lwgray lwgray requested a review from bbengfort July 29, 2023 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants