Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipelines can't be nested as any other learner #199

Open
peguerosdc opened this issue Jun 9, 2022 · 1 comment
Open

Pipelines can't be nested as any other learner #199

peguerosdc opened this issue Jun 9, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@peguerosdc
Copy link

peguerosdc commented Jun 9, 2022

Code sample

This script is complete, it should run "as is"

from fklearn.training.regression import linear_regression_learner
from fklearn.training.pipeline import build_pipeline
from fklearn.training.imputation import imputer

# This example will try to create a linear regressor...
def load_sample_data():
    import pandas as pd
    from sklearn import datasets
    x,y = datasets.load_diabetes(as_frame=True, return_X_y=True)
    return pd.concat([x, y], axis=1)

# ...but pipelines can't be used as individual learners
nested_learner = build_pipeline(
    build_pipeline(
        imputer(columns_to_impute=["age"], impute_strategy='median'),
    ),
    linear_regression_learner(features=["age"], target=["target"])
)
fn, df_result, logs = nested_learner(load_sample_data())
'''
Output:
ValueError: Predict function of learner pipeline contains variable length arguments: kwargs
Make sure no predict function uses arguments like *args or **kwargs.
'''

Problem description

This is a problem because (taken from the docs) "pipelines (should) behave exactly as individual learner functions". That is, pipelines should be consistent with L in SOLID, but that is not happening.

The main benefit of supporting nested pipelines is that you can produce more maintainable code as packing complex operations in one big step like:

def custom_learner(feature: str, target: str):
    return build_pipeline(
        imputer(columns_to_impute=[feature], impute_strategy='median'),
        linear_regression_learner(features=[feature], target=[target])
    )

Is cleaner and more readable.

Expected behavior

The "Code sample" should produce the same output as if nested_learner was defined as:

nested_learner = build_pipeline(
    imputer(columns_to_impute=["age"], impute_strategy='median'),
    linear_regression_learner(features=["age"], target=["target"])
)

That is, if a pipeline is a type of learner, it should be possible to put it in place of any other learner.

Possible solutions

A workaround is proposed in #145 , but it only works if you already have a DataFrame (which doesn't happen in this scenario). Would like to hear if someone has investigated this or knows what we need to change to support this, but in the meantime I propose adding a note to the docs to prevent someone else from trying to do this.

@peguerosdc peguerosdc added the bug Something isn't working label Jun 9, 2022
@jmoralez
Copy link

jmoralez commented Jul 8, 2022

Hi. I think this would be useful as well when you want to split data-preprocessing from the model. From what I understand the problem is that the predict_fn of the pipeline takes kwargs, determines where that kwarg belongs and assigns it to the corresponding function. So if you have two pipelines I think it gets hard to determine which pipeline the kwarg belongs to.

One possible solution I see to this is taking the approach that sklearn takes when passing parameters to the estimators in a pipeline:

Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p. source

So I think it would be possible to have nested pipelines and then calling the predict_fn with kwargs using this kind of prefixing, i.e. if we have the following:

pipe1 = build_pipeline([f(a=1), g(a=2)])
pipe2 = build_pipeline([f(a=2), g(a=3)])
pipe3 = build_pipeline([pipe1, pipe2])
predict_fn, data, logs = pipe3(df)
# the following would set the argument a to 3 in the f function of the first pipeline
predict_fn(df2, pipeline0__f__a=3)
# the following would set the argument a to 2 in the g function of the second pipeline
predict_fn(df2, pipeline1__g__a=2)

We would of course have the possibility of setting custom names to each pipeline and function so the argument could be like preprocessing__scaler__column='something'

WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants