-
Notifications
You must be signed in to change notification settings - Fork 1
Data Preparation
ahmedmohammed107 edited this page May 8, 2024
·
9 revisions
Methods:
- init: All it does is self.config = config. The config must contain "data_prep_registered_id" and "save_loc". It will also include all configs related to the module's logic. We can also call save_config from init.
- load_config: Loads configs saved in save_loc where the processed data and the config file are saved
- save_config: Saves configs in save_loc. If save_loc already exists, it loads the saved configs and asserts that configs with which the prep object is instantiated is the same as the saved configs. If not, raise an error !
- save_data: Abstract method. Whenever called, it should first call save_config before it saves any data. But how can we force users to follow this standard? An idea would be to call save_config from init (given that users will never override init).
- run: Abstract method. Will change significantly from one project to another
- save_module: Abstract method. Saves the module state (ex: scikit learn scaler's state)
- load_module: Abstract method. Loads the module state
- init ===> config = {'data_prep_registered_id': 'MinMaxScaler_v0', 'save_loc': '/path/to/save/loc/', 'feature_range': [0, 1]}
- load_config ===> No arguments for this method as it already knows self.save_loc in which config.json is stored
- save_config ===> No args
- save_data ===> No args
- run ===> Takes df: pd.DataFrame as input
- save_module ===> No args
- load_module ===> No args
While making sure that we are saving the configs with which the module was instantiated is beneficial, it is not enough in case there are several prep modules that are called sequentially. We need a way to know the sequence in which these modules were called to identify the entire preprocessing pipeline that lead to this saved data object.