Pipelines can't be nested as any other learner #199

peguerosdc · 2022-06-09T01:11:26Z

Code sample

This script is complete, it should run "as is"

from fklearn.training.regression import linear_regression_learner
from fklearn.training.pipeline import build_pipeline
from fklearn.training.imputation import imputer

# This example will try to create a linear regressor...
def load_sample_data():
    import pandas as pd
    from sklearn import datasets
    x,y = datasets.load_diabetes(as_frame=True, return_X_y=True)
    return pd.concat([x, y], axis=1)

# ...but pipelines can't be used as individual learners
nested_learner = build_pipeline(
    build_pipeline(
        imputer(columns_to_impute=["age"], impute_strategy='median'),
    ),
    linear_regression_learner(features=["age"], target=["target"])
)
fn, df_result, logs = nested_learner(load_sample_data())
'''
Output:
ValueError: Predict function of learner pipeline contains variable length arguments: kwargs
Make sure no predict function uses arguments like *args or **kwargs.
'''

Problem description

This is a problem because (taken from the docs) "pipelines (should) behave exactly as individual learner functions". That is, pipelines should be consistent with L in SOLID, but that is not happening.

The main benefit of supporting nested pipelines is that you can produce more maintainable code as packing complex operations in one big step like:

def custom_learner(feature: str, target: str):
    return build_pipeline(
        imputer(columns_to_impute=[feature], impute_strategy='median'),
        linear_regression_learner(features=[feature], target=[target])
    )

Is cleaner and more readable.

Expected behavior

The "Code sample" should produce the same output as if nested_learner was defined as:

nested_learner = build_pipeline(
    imputer(columns_to_impute=["age"], impute_strategy='median'),
    linear_regression_learner(features=["age"], target=["target"])
)

That is, if a pipeline is a type of learner, it should be possible to put it in place of any other learner.

Possible solutions

A workaround is proposed in #145 , but it only works if you already have a DataFrame (which doesn't happen in this scenario). Would like to hear if someone has investigated this or knows what we need to change to support this, but in the meantime I propose adding a note to the docs to prevent someone else from trying to do this.

The text was updated successfully, but these errors were encountered:

jmoralez · 2022-07-08T16:59:26Z

Hi. I think this would be useful as well when you want to split data-preprocessing from the model. From what I understand the problem is that the predict_fn of the pipeline takes kwargs, determines where that kwarg belongs and assigns it to the corresponding function. So if you have two pipelines I think it gets hard to determine which pipeline the kwarg belongs to.

One possible solution I see to this is taking the approach that sklearn takes when passing parameters to the estimators in a pipeline:

Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p. source

So I think it would be possible to have nested pipelines and then calling the predict_fn with kwargs using this kind of prefixing, i.e. if we have the following:

pipe1 = build_pipeline([f(a=1), g(a=2)])
pipe2 = build_pipeline([f(a=2), g(a=3)])
pipe3 = build_pipeline([pipe1, pipe2])
predict_fn, data, logs = pipe3(df)
# the following would set the argument a to 3 in the f function of the first pipeline
predict_fn(df2, pipeline0__f__a=3)
# the following would set the argument a to 2 in the g function of the second pipeline
predict_fn(df2, pipeline1__g__a=2)

We would of course have the possibility of setting custom names to each pipeline and function so the argument could be like preprocessing__scaler__column='something'

WDYT?

peguerosdc added the bug Something isn't working label Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipelines can't be nested as any other learner #199

Pipelines can't be nested as any other learner #199

peguerosdc commented Jun 9, 2022 •

edited

Loading

jmoralez commented Jul 8, 2022

Pipelines can't be nested as any other learner #199

Pipelines can't be nested as any other learner #199

Comments

peguerosdc commented Jun 9, 2022 • edited Loading

Code sample

Problem description

Expected behavior

Possible solutions

jmoralez commented Jul 8, 2022

peguerosdc commented Jun 9, 2022 •

edited

Loading