Skip to content

Data Preparation

ahmedmohammed107 edited this page May 8, 2024 · 9 revisions

Data Preparation Base Class

Methods:

  1. init: All it does is self.config = config. The config must contain "data_prep_registered_id" and "save_loc". It will also include all configs related to the module's logic. We can also call save_config from init.
  2. load_config: Loads configs saved in save_loc where the processed data and the config file are saved
  3. save_config: Saves configs in save_loc. If save_loc already exists, it loads the saved configs and asserts that configs with which the prep object is instantiated is the same as the saved configs. If not, raise an error !
  4. save_data: Abstract method. Whenever called, it should first call save_config before it saves any data. But how can we force users to follow this standard? An idea would be to call save_config from init (given that users will never override init).
  5. run: Abstract method. Will change significantly from one project to another
  6. save_module: Abstract method. Saves the module state (ex: scikit learn scaler's state)
  7. load_module: Abstract method. Loads the module state

Example: MinMaxScaler

  1. init ===> config = {'data_prep_registered_id': 'MinMaxScaler_v0', 'save_loc': '/path/to/save/loc/', 'feature_range': [0, 1]}
  2. load_config ===> No arguments for this method as it already knows self.save_loc in which config.json is stored
  3. save_config ===> No args
  4. save_data ===> No args
  5. run ===> Takes df: pd.DataFrame as input
  6. save_module ===> No args
  7. load_module ===> No args

Concern

While making sure that we are saving the configs with which the module was instantiated is beneficial, it is not enough in case there are several prep modules that are called sequentially. We need a way to know the sequence in which these modules were called to identify the entire preprocessing pipeline that lead to this saved data object.

Clone this wiki locally