-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Predictor Contribution #96
Changes from 5 commits
3891f5a
b468e7b
35c01a9
e90b126
3e28d36
523c3e8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instructions on how to create a custom predictor. We link to relevant files to read over to get an idea of how to do it. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Custom Predictors | ||
|
||
This directory contains custom predictors that can be used with the ELUC use case. Since percent change is measurable, we look for predictors that can predict ELUC. | ||
|
||
## Create a Custom Predictor | ||
|
||
An example custom predictor can be found in the [template](predictors/custom/template) folder. In order to create a custom predictor, 2 steps must be completed. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The link seems wrong. We're already in |
||
|
||
1. You need to implement the `Predictor` interface. This is defined in [predictor.py](predictors/predictor.py). It is a simple abstract class that requires a `predict` method that takes in a dataframe of context and actions and returns a dataframe of outcomes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The link seems wrong. We're already in |
||
|
||
2. You need, either in the same class or a specific serializer class, to implement a `load` method that takes in a path to a model on disk and returns an instance of the `Predictor`. (See [serializer.py](persistence/persistors/serializers/serializer.py) for the interface for serialization and [neural_network_serializer.py](persistence/persistors/serializers/neural_network_serializer.py) for an example of how to implement serialization.) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check these other links too There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed all links to be relative rather than absolute |
||
|
||
Finally, you must add your custom predictor to the [config](predictors/evaluation/config.json) file in order to evaluate it. | ||
|
||
### Load from HuggingFace | ||
|
||
To load a custom model saved on HuggingFace, see the [HuggingFacePersistor](persistence/persistors/hf_persistor.py) class. It takes in a `FileSerializer` to download a HuggingFace model to disk then load it. An example of how to evaluate a model from HuggingFace can be found in the [config](predictors/evaluation/config.json). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
""" | ||
See here for how to impelement a predictor: | ||
""" | ||
import pandas as pd | ||
|
||
from data import constants | ||
from predictors.predictor import Predictor | ||
|
||
class TemplatePredictor(Predictor): | ||
""" | ||
A template predictor returning dummy values for ELUC. | ||
The class that gets passed into the Evaluator should call the load method which should return a Predictor. | ||
The Predictor just needs to impelement predict. | ||
""" | ||
def __init__(self): | ||
super().__init__(context=constants.CAO_MAPPING["context"], | ||
actions=constants.CAO_MAPPING["actions"], | ||
outcomes=constants.CAO_MAPPING["outcomes"]) | ||
|
||
def fit(self, X_train, y_train): | ||
pass | ||
|
||
def predict(self, context_actions_df: pd.DataFrame) -> pd.DataFrame: | ||
dummy_eluc = list(range(len(context_actions_df))) | ||
return pd.DataFrame({"ELUC": dummy_eluc}, index=context_actions_df.index) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dummy predictor just to show people how to create one. Just returns random numbers. |
||
|
||
@classmethod | ||
def load(cls, path: str) -> "TemplatePredictor": | ||
""" | ||
Dummy load function that just returns a new instance of the class. | ||
""" | ||
print("Loading model from", path) | ||
return cls() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dummy predictor implements the load function rather than being in its own serializer to show that this is possible. All you need is a load and predict function but the serializer makes things nicer in our official predictors. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
{ | ||
"models": [ | ||
{ | ||
"type": "local", | ||
"name": "TemplatePredictor", | ||
"classpath": "predictors/custom/template/template_predictor.py", | ||
"filepath": "predictors/custom/template/model.pt" | ||
}, | ||
{ | ||
"type": "hf", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can load with local or huggingface from the config |
||
"name": "NeuralNetSerializer", | ||
"classpath": "persistence/serializers/neural_network_serializer.py", | ||
"url": "danyoung/eluc-global-nn", | ||
"filepath": "predictors/trained_models/danyoung--eluc-global-nn" | ||
}, | ||
{ | ||
"type": "hf", | ||
"name": "SKLearnSerializer", | ||
"url": "danyoung/eluc-global-linreg", | ||
"classpath": "persistence/serializers/sklearn_serializer.py", | ||
"filepath": "predictors/trained_models/danyoung--eluc-global-linreg" | ||
}, | ||
{ | ||
"type": "hf", | ||
"name": "SKLearnSerializer", | ||
"url": "danyoung/eluc-global-rf", | ||
"classpath": "persistence/serializers/sklearn_serializer.py", | ||
"filepath": "predictors/trained_models/danyoung--eluc-global-rf" | ||
} | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
""" | ||
Class to evaluate predictors given a config on a dataset. | ||
Also a script to demo how it works. | ||
""" | ||
import importlib | ||
import json | ||
from pathlib import Path | ||
|
||
import pandas as pd | ||
|
||
import data.constants as constants | ||
from data.eluc_data import ELUCData | ||
from persistence.persistors.hf_persistor import HuggingFacePersistor | ||
from predictors.predictor import Predictor | ||
from predictors.evaluation.validator import Validator | ||
|
||
class Evaluator(): | ||
""" | ||
Evaluator class to evaluate predictors on a dataset. | ||
Uses a config to dynamically load predictors. | ||
The config must point to the classpath of a serializer that can call .load() to return a Predictor object. | ||
Alternatively, it may use a HuggingFace url to download a model to a given path, THEN load with the serializer. | ||
""" | ||
def __init__(self, config: dict): | ||
""" | ||
Initializes the Evaluator with the custom classes it has to load. | ||
""" | ||
self.predictors = self.dynamically_load_models(config) | ||
# We don't pass change into the outcomes column. | ||
self.validator = Validator(constants.CAO_MAPPING["context"], constants.CAO_MAPPING["actions"], ["ELUC"]) | ||
|
||
def dynamically_load_models(self, config: dict) -> list[Predictor]: | ||
""" | ||
Uses importlib to dynamically load models from a config. | ||
Config must have a list of models with the following: | ||
- type: "hf" or "local" to determine if it is a HuggingFace model or local model. | ||
- name: name of the serializer class to load. | ||
- classpath: path to the class that calls .load() | ||
- filepath: path to the model on disk or where to save the HuggingFace model. | ||
- (optional) url: url to download the model from HuggingFace. | ||
Returns a dict with keys being the filepath and values being the Predictor object. | ||
""" | ||
predictors = {} | ||
for model in config["models"]: | ||
# We dynamically instantiate model_instance as some sort of class that can handle .load() and returns | ||
# a Predictor object. | ||
spec = importlib.util.spec_from_file_location(model["name"], model["classpath"]) | ||
module = importlib.util.module_from_spec(spec) | ||
spec.loader.exec_module(module) | ||
model_instance = getattr(module, model["name"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dynamic load function uses importlib to load arbitrary classes. Did not realize python could do this! |
||
|
||
# Once we have our model_instance we can load the model from disk or from HuggingFace. | ||
if model["type"] == "hf": | ||
persistor = HuggingFacePersistor(model_instance()) | ||
predictor = persistor.from_pretrained(model["url"], local_dir=model["filepath"]) | ||
elif model["type"] == "local": | ||
predictor = model_instance().load(Path(model["filepath"])) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can load from huggingface or locally |
||
else: | ||
raise ValueError("Model type must be either 'hf' or 'local'") | ||
predictors[model["filepath"]] = predictor | ||
return predictors | ||
|
||
def evaluate(self, test_df: pd.DataFrame): | ||
""" | ||
Evaluates our list of predictors on a given test dataframe. | ||
The dataframe is expected to be raw data. | ||
""" | ||
y_true = test_df["ELUC"] | ||
test_df = self.validator.validate_input(test_df) | ||
results = {} | ||
for predictor_path, predictor in self.predictors.items(): | ||
outcome_df = predictor.predict(test_df) | ||
assert self.validator.validate_output(test_df, outcome_df) | ||
y_pred = outcome_df["ELUC"] | ||
mae = (y_true - y_pred).abs().mean() | ||
results[predictor_path] = mae | ||
return results | ||
|
||
def run_evaluation(): | ||
""" | ||
A demo script to show how the Evaluator class works. | ||
""" | ||
print("Evaluating models in config.json...") | ||
config = json.load(open(Path("predictors/evaluation/config.json"), "r", encoding="utf-8")) | ||
evaluator = Evaluator(config) | ||
dataset = ELUCData.from_hf() | ||
results = evaluator.evaluate(dataset.test_df) | ||
print("Results:") | ||
print(results) | ||
|
||
if __name__ == "__main__": | ||
run_evaluation() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
""" | ||
Validation of input and output dataframes for predictor evaluation. | ||
""" | ||
import pandas as pd | ||
|
||
class Validator(): | ||
""" | ||
Validates input and output dataframes for predictor evaluation. | ||
Context, actions, outcomes do not necessarily have to match the project's CAO_MAPPING. For example, if we are | ||
just evaluating ELUC we can just pass the single column as outcomes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But we still need to have at least the context and actions in the input of the predictor, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Have you looked at https://github.com/cognizant-ai-labs/covid-xprize/blob/master/covid_xprize/validation/predictor_validation.py for inspiration? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Currently we leave it open to the Validator which context/actions to make sure are in the inputs. When we create the Validator in the Scoring script we pass in our correct CA and just ELUC as O. |
||
""" | ||
def __init__(self, context: list[str], actions: list[str], outcomes: list[str]): | ||
self.context = context | ||
self.actions = actions | ||
self.outcomes = outcomes | ||
|
||
def validate_input(self, context_actions_df: pd.DataFrame) -> pd.DataFrame: | ||
""" | ||
Verifies all the context and actions columns are in context_actions_df. | ||
Then removes outcomes from context_actions_df and returns a deep copy of it. | ||
""" | ||
if not set(self.context + self.actions) <= set(context_actions_df.columns): | ||
not_seen = set(self.context + self.actions) - set(context_actions_df.columns) | ||
raise ValueError(f"Columns {not_seen} not found in input dataframe.") | ||
|
||
seen_outcomes = [col for col in self.outcomes if col in context_actions_df.columns] | ||
return context_actions_df.drop(columns=seen_outcomes).copy() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We validate that the context and actions columns are in the input, then remove outcomes from the input |
||
|
||
def validate_output(self, context_actions_df: pd.DataFrame, outcomes_df: pd.DataFrame): | ||
""" | ||
Makes sure the index of context_actions_df and outcomes_df match so we can compute metrics like MAE. | ||
Also checks if all outcomes are present in the outcomes_df. | ||
""" | ||
if not context_actions_df.index.equals(outcomes_df.index): | ||
raise ValueError("Index of context_actions_df and outcomes_df do not match.") | ||
|
||
if not set(self.outcomes) == set(outcomes_df.columns): | ||
print(self.outcomes, outcomes_df.columns) | ||
not_seen = set(self.outcomes) - set(outcomes_df.columns) | ||
raise ValueError(f"Outcomes {not_seen} not found in output dataframe.") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We make sure indices are the same in the output so we can compare and make sure the column name matches up. |
||
|
||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to rename this folder because it was getting confused with the real sklearn library. It took me an hour to debug this!!