Data Preparation

Data Preparation Base Class

Methods:

init: All it does is self.config = config. The config must contain "data_prep_registered_id" and "save_loc". It will also include all configs related to the module's logic. We can also call save_config from init.
load_config: Loads configs saved in save_loc where the processed data and the config file are saved
save_config: Saves configs in save_loc. If save_loc already exists, it loads the saved configs and asserts that configs with which the prep object is instantiated is the same as the saved configs. If not, raise an error !
save_data: Abstract method. Whenever called, it should first call save_config before it saves any data. But how can we force users to follow this standard? An idea would be to call save_config from init (given that users will never override init).
run: Abstract method. Will change significantly from one project to another
save_module: Abstract method. Saves the module state (ex: scikit learn scaler's state)
load_module: Abstract method. Loads the module state

Example: MinMaxScaler

init ===> config = {'data_prep_registered_id': 'MinMaxScaler_v0', 'save_loc': '/path/to/save/loc/', 'feature_range': [0, 1]}
load_config ===> No arguments for this method as it already knows self.save_loc in which config.json is stored
save_config ===> No args
save_data ===> No args
run ===> Takes df: pd.DataFrame as input
save_module ===> No args
load_module ===> No args

Concern

While making sure that we are saving the configs with which the module was instantiated is beneficial, it is not enough in case there are several prep modules that are called sequentially. We need a way to know the sequence in which these modules were called to identify the entire preprocessing pipeline that lead to this saved data object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Preparation

Data Preparation Base Class

Example: MinMaxScaler

Concern

Clone this wiki locally