From 6506b3efb07871c8c2f8dc39821331b6ea88b054 Mon Sep 17 00:00:00 2001
From: piotrlaczkowski
In addition to Keras layers, the PreprocessorLayerFactory
includes several custom layers specific to the KDP framework. Here's a list of available custom layers:
Welcome to the Future of Data Preprocessing!
Diving into the world of machine learning and data science, we often find ourselves tangled in the preprocessing jungle. Worry no more! Introducing a state-of-the-art data preprocessing model based on TensorFlow Keras and the innovative use of Keras preprocessing layers.
Say goodbye to tedious data preparation tasks and hello to streamlined, efficient, and scalable data pipelines. Whether you're a seasoned data scientist or just starting out, this tool is designed to supercharge your ML workflows, making them more robust and faster than ever!
"},{"location":"#key-features","title":"\ud83d\udd11 Key Features:","text":"\ud83d\udee0 Automated Feature Engineering: Automatically detects and applies the most suitable preprocessing steps for each feature type in your dataset, ensuring optimal data preparation with minimal manual intervention.
\ud83c\udfa8 Customizable Preprocessing Pipelines: Tailor your preprocessing steps with ease. Choose from a comprehensive range of options for numeric, categorical, text data, and even complex feature crosses, allowing for precise and effective data handling.
\ud83d\udcca Scalability and Efficiency: Engineered for high performance, this tool handles large datasets effortlessly, leveraging TensorFlow's robust computational capabilities.
\ud83e\udde0 Enhanced with Transformer Blocks: Incorporate transformer blocks into your preprocessing model to boost feature interaction analysis and uncover complex patterns, enhancing predictive model accuracy.
\u2699\ufe0f Easy Integration: Designed to seamlessly integrate as the first layers in your TensorFlow Keras models, facilitating a smooth transition from raw data to trained model, accelerating your workflow significantly.
We use poetry for handling dependencies so you will need to install it first. Then you can install the dependencies by running:
To install dependencies:
poetry install\n
or to enter a dedicated env directly:
poetry shell\n
Then you can simply configure your preprocessor:
"},{"location":"#building-preprocessor","title":"\ud83d\udee0\ufe0f Building Preprocessor:","text":"The simplest application of ths preprocessing model is as follows:
from kdp import PreprocessingModel\nfrom kdp import FeatureType\n\n# DEFINING FEATURES PROCESSORS\nfeatures_specs = {\n # ======= NUMERICAL Features =========================\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": FeatureType.FLOAT_RESCALED,\n # ======= CATEGORICAL Features ========================\n \"feat3\": FeatureType.STRING_CATEGORICAL,\n \"feat4\": FeatureType.INTEGER_CATEGORICAL,\n # ======= TEXT Features ========================\n \"feat5\": FeatureType.TEXT,\n}\n\n# INSTANTIATE THE PREPROCESSING MODEL with your data\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n)\n# construct the preprocessing pipelines\nppr.build_preprocessor()\n
This wil output:
{\n'model': <Functional name=preprocessor, built=True>,\n'inputs': {\n 'feat1': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat1>,\n 'feat2': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat2>,\n 'feat3': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat3>,\n 'feat4': <KerasTensor shape=(None, 1), dtype=int32, sparse=None, name=feat4>,\n 'feat5': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat5>\n },\n'signature': {\n 'feat1': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat1'),\n 'feat2': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat2'),\n 'feat3': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat3'),\n 'feat4': TensorSpec(shape=(None, 1), dtype=tf.int32, name='feat4'),\n 'feat5': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat5')\n },\n'output_dims': 45\n}\n
This will result in the following preprocessing steps:
Success
You can define the preprocessing model with the features_specs
dictionary, where the keys are the feature names and the values are the feature types. The model will automatically apply the appropriate preprocessing steps based on the feature type.
You have access to several layers of customization per feature type, such as normalization, rescaling, or even definition of custom preprocessing steps.
See \ud83d\udc40 Defining Features for more details.
Info
You can use the preprocessing model independently to preprocess your data or integrate it into your Keras model as the first layer, see \ud83d\udc40 Integrations
"},{"location":"#advanced-configuration-options","title":"Advanced Configuration Options","text":""},{"location":"#transformer-blocks-configuration","title":"Transformer Blocks Configuration","text":"Enhance your preprocessing model with transformer blocks to capture complex patterns and interactions between features, see \ud83d\udc40 Transformer Blocks. You can configure the transformer blocks as follows:
CATEGORICAL
) or to all features (ALL_FEATURES
).Example configuration:
transfo_config = {\n 'transfo_nr_blocks': 3,\n 'transfo_nr_heads': 4,\n 'transfo_ff_units': 64,\n 'transfo_dropout_rate': 0.1,\n 'transfo_placement': 'ALL_FEATURES'\n}\n\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n **transfo_config\n)\n
"},{"location":"#custom-preprocessors","title":"Custom Preprocessors","text":"Tailor your preprocessing steps with custom preprocessors for each feature type. Define specific preprocessing logic that fits your data characteristics or domain-specific requirements, see \ud83d\udc40 Custom Preprocessors.
Example of adding a custom preprocessor:
from kdp.custom_preprocessors import MyCustomScaler\n\nfeatures_specs = {\n \"feat1\": {\n 'feature_type': FeatureType.FLOAT_NORMALIZED,\n 'preprocessors': [MyCustomScaler()]\n }\n}\n\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec\n)\n
"},{"location":"#feature-crosses","title":"Feature Crosses","text":"Create complex feature interactions by crossing features. This method combines features into a single feature, which can be particularly useful for models that benefit from understanding interactions between specific features, see \ud83d\udc40 Feature Crosses.
Example of defining feature crosses:
feature_crosses = [\n (\"feat1\", \"feat2\", 10),\n (\"feat3\", \"feat4\", 5)\n]\n\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n feature_crosses=feature_crosses\n)\n
These advanced configurations allow for greater flexibility and power in your preprocessing pipelines, enabling more sophisticated data transformations and feature engineering.
"},{"location":"contributing/","title":"\ud83d\udcbb Contributing: Join the Preprocessing Revolution! \ud83d\udee0\ufe0f","text":"Eager to contribute? Great! We're excited to welcome new contributors to our project. Here's how you can get involved:
"},{"location":"contributing/#new-ideas-features-requests","title":"\ud83d\udca1 New Ideas / Features Requests","text":"If you wan't to request a new feature or you have detected an issue, please use the following link: ISSUES
"},{"location":"contributing/#getting-started","title":"\ud83d\ude80 Getting Started:","text":"Fork the Repository: Visit our GitHub page, fork the repository, and clone it to your local machine.
Set Up Your Environment: Make sure you have TensorFlow, Loguru, and all necessary dependencies installed.
Make sure you have installed the pre-commit hook locally
??? installation-guide Before using pre-commit hook you need to install it in your python environment.
```bash\n conda install -c conda-forge pre-commit\n ```\n\n go to the root folder of this repository, activate your venv and use the following command:\n\n ```bash\n pre-commit install\n ```\n
Create a new branch to package your code
Use standarized commit message:
{LABEL}(KDP): {message}
This is very important for the automatic releases (semantic release) and to have clean history on the master branch.
??? Labels-types
| Label | Usage |\n | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n | break | `break` is used to identify changes related to old compatibility or functionality that breaks the current usage (major) |\n | feat | `feat` is used to identify changes related to new backward-compatible abilities or functionality (minor) |\n | init | `init` is used to indentify the starting related to the project (minor) |\n | enh | `enh` is used to indentify changes related to amelioration of abilities or functionality (patch) |\n | build | `build` (also known as `chore`) is used to identify **development** changes related to the build system (involving scripts, configurations, or tools) and package dependencies (patch) |\n | ci | `ci` is used to identify **development** changes related to the continuous integration and deployment system - involving scripts, configurations, or tools (minor) |\n | docs | `docs` is used to identify documentation changes related to the project; whether intended externally for the end-users or internally for the developers (patch) |\n | perf | `perf` is used to identify changes related to backward-compatible **performance improvements** (patch) |\n | refactor | `refactor` is used to identify changes related to modifying the codebase, which neither adds a feature nor fixes a bug - such as removing redundant code, simplifying the code, renaming variables, etc.<br />i.e. handy for your wip ; ) (patch) |\n | style | `style` is used to identify **development** changes related to styling the codebase, regardless of the meaning - such as indentations, semi-colons, quotes, trailing commas, and so on (patch) |\n | test | `test` is used to identify **development** changes related to tests - such as refactoring existing tests or adding new tests. (minor) |\n | fix | `fix` is used to identify changes related to backward-compatible bug fixes. (patch) |\n | ops | `ops` is used to identify changes related to deployment files like `values.yml`, `gateway.yml,` or `Jenkinsfile` in the **ops** directory. (minor) |\n | hotfix | `hotfix` is used to identify **production** changes related to backward-compatible bug fixes (patch) |\n | revert | `revert` is used to identify backward changes (patch) |\n | maint | `maint` is used to identify **maintenance** changes related to project (patch) |\n
Merge requests will be responsible for semantic-release storytelling and so use them wisely! The changelog report generated automatically will be based on your commits merged into main branch and should cover all the thins you did for the project, as an example:
feat
labelThis about what part of feature you are working on, (messages) i.e.:
- `initializaing base pre-processing code`\n - `init repo structure`\n - `adding pre-processing unit-tests`\n
The name of your MR should follow the same exact convention as your commits (we have a dedicated check for this in the CI):
`{LABEL}(KDP): {message}`\n
Use small Merge Requests but do them more ofthen < 400 ligns for quicker and simple review and not the whole project !
Ask for a Code Review !
Once your MR is approved, solve all your unresolved conversation and pass all the CI check before you can merge it.
All the Tests for your code should pass -> REMEMBER NO TESTS = NO MERGE \ud83d\udea8
Customize the preprocessing pipeline by setting up a dictionary that maps feature names to their respective types, tailored to your specific requirements.
"},{"location":"features/#numeric-features","title":"\ud83d\udcaf Numeric Features","text":"Explore various methods to define numerical features tailored to your needs:
\u2139\ufe0f Simple Declaration\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom NumericalFeaturefeatures_specs = {\n \"feat1\": \"float\",\n \"feat2\": \"FLOAT\",\n \"feat3\": \"FLOAT_NORMALIZED\",\n \"feat3\": \"FLOAT_RESCALED\",\n ...\n}\n
Utilize predefined preprocessing configurations with FeatureType
.
from kdp.features import FeatureType\n\nfeatures_specs = {\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": FeatureType.FLOAT_RESCALED,\n ...\n}\n
Available FeatureType
options:
Customize preprocessing by passing specific parameters to NumericalFeature
.
from kdp.features import NumericalFeature\n\nfeatures_specs = {\n \"feat3\": NumericalFeature(\n name=\"feat3\",\n feature_type=FeatureType.FLOAT_DISCRETIZED,\n bin_boundaries=[(1, 10)],\n ),\n \"feat4\": NumericalFeature(\n name=\"feat4\",\n feature_type=FeatureType.FLOAT,\n ),\n ...\n}\n
Here's how the numeric preprocessing pipeline looks:
"},{"location":"features/#categorical-features","title":"\ud83d\udc08\u200d\u2b1b Categorical Features","text":"Define categorical features flexibly:
\u2139\ufe0f Simple Declaration\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom CategoricalFeaturefeatures_specs = {\n \"feat1\": \"INTEGER_CATEGORICAL\",\n \"feat2\": \"STRING_CATEGORICAL\",\n \"feat3\": \"string_categorical\",\n ...\n}\n
Leverage default configurations with FeatureType
.
from kdp.features import FeatureType\n\nfeatures_specs = {\n \"feat1\": FeatureType.INTEGER_CATEGORICAL,\n \"feat2\": FeatureType.STRING_CATEGORICAL,\n ...\n}\n
Available FeatureType
options:
Tailor feature processing by specifying properties in CategoricalFeature
.
from kdp.features\nfrom kdp.features import CategoricalFeature\n\nfeatures_specs = {\n \"feat1\": CategoricalFeature(\n name=\"feat7\",\n feature_type=FeatureType.INTEGER_CATEGORICAL,\n embedding_size=100,\n ),\n \"feat2\": CategoricalFeature(\n name=\"feat2\",\n feature_type=FeatureType.STRING_CATEGORICAL,\n ),\n ...\n}\n
See how the categorical preprocessing pipeline appears:
"},{"location":"features/#text-features","title":"\ud83d\udcdd Text Features","text":"Customize text features in multiple ways to fit your project's demands:
\u2139\ufe0f Simple Declaration\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom TextFeaturefeatures_specs = {\n \"feat1\": \"text\",\n \"feat2\": \"TEXT\",\n ...\n}\n
Use FeatureType
for automatic default preprocessing setups.
from kdp.features import FeatureType\n\nfeatures_specs = {\n \"feat1\": FeatureType.TEXT,\n \"feat2\": FeatureType.TEXT,\n ...\n}\n
Available FeatureType
options:
Customize text preprocessing by passing specific arguments to TextFeature
.
from kdp.features import TextFeature\n\nfeatures_specs = {\n \"feat1\": TextFeature(\n name=\"feat2\",\n feature_type=FeatureType.TEXT,\n max_tokens=100,\n stop_words=[\"stop\", \"next\"],\n ),\n \"feat2\": TextFeature(\n name=\"feat2\",\n feature_type=FeatureType.TEXT,\n ),\n ...\n}\n
Here's how the text feature preprocessing pipeline looks:
"},{"location":"features/#cross-features","title":"\u274c Cross Features","text":"To implement cross features, specify a list of feature tuples in the PreprocessingModel
like so:
from kdp.processor import PreprocessingModel\n\nppr = PreprocessingModel(\n path_data=\"data/data.csv\",\n features_specs={\n \"feat6\": FeatureType.STRING_CATEGORICAL,\n \"feat7\": FeatureType.INTEGER_CATEGORICAL,\n },\n feature_crosses=[(\"feat6\", \"feat7\", 5)],\n)\n
Example cross feature between INTEGER_CATEGORICAL and STRING_CATEGORICAL:
"},{"location":"features/#date-features","title":"\ud83d\udcc6 Date Features","text":"You can even process string encoded date features (format: 'YYYY-MM-DD'):
\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom DateFeatureUse FeatureType
for automatic default preprocessing setups.
from kdp.processor import PreprocessingModel\n\nppr = PreprocessingModel(\n path_data=\"data/data.csv\",\n features_specs={\n \"feat1\": FeatureType.FLOAT,\n \"feat2\": FeatureType.DATE,\n },\n)\n
Customize text preprocessing by passing specific arguments to TextFeature
.
from kdp.features import DateFeature\n\nfeatures_specs = {\n \"feat1\": DateFeature(\n name=\"feat2\",\n feature_type=FeatureType.FLOAT,\n ),\n \"feat2\": TextFeature(\n name=\"feat2\",\n feature_type=FeatureType.DATE,\n # additional option to add season layer:\n add_season=True, # adds one-hot season indicator (summer, winter, etc) defaults to False\n ),\n ...\n}\n
Example date and numeric processing pipeline:
"},{"location":"features/#custom-preprocessing-steps","title":"\ud83d\ude80 Custom Preprocessing Steps","text":"If you require even more customization, you can define custom preprocessing steps using the Feature
class, using preprocessors
attribute.
The preprocessors
attribute accepts a list of methods defined in PreprocessorLayerFactory
.
from kdp.features import Feature\n\nfeatures_specs = {\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": Feature(\n name=\"custom_feature_pipeline\",\n feature_type=FeatureType.FLOAT_NORMALIZED,\n preprocessors=[\n tf.keras.layers.Rescaling,\n tf.keras.layers.Normalization,\n\n ],\n # leyers required kwargs\n scale=1,\n )\n}\n
Here's how the text feature preprocessing pipeline looks:
The full list of available layers can be found: Preprocessing Layers Factory
"},{"location":"integrations/","title":"\ud83d\udd17 Integrating Preprocessing Model with other Keras Model:","text":"You can then easily ingetrate this model into your keras model as the first layer:
"},{"location":"integrations/#example-1-using-the-preprocessing-model-as-the-first-layer-of-a-sequential-model","title":"Example 1: Using the Preprocessing Model as the first layer of a Sequential Model","text":"class FunctionalModelWithPreprocessing(tf.keras.Model):\n def __init__(self, preprocessing_model: tf.keras.Model) -> None:\n \"\"\"Initialize the user model.\n\n Args:\n preprocessing_model (tf.keras.Model): The preprocessing model.\n \"\"\"\n super().__init__()\n self.preprocessing_model = preprocessing_model\n\n # Dynamically create inputs based on the preprocessing model's input shape\n inputs = {\n name: tf.keras.Input(shape=shape[1:], name=name)\n for name, shape in self.preprocessing_model.input_shape.items()\n }\n\n # You can use the preprocessing model directly in the functional API.\n x = self.preprocessing_model(inputs)\n\n # Define the dense layer as part of the model architecture\n output = tf.keras.layers.Dense(\n units=128,\n activation=\"relu\",\n )(x)\n\n # Use the Model's functional API to define inputs and outputs\n self.model = tf.keras.Model(inputs=inputs, outputs=output)\n\n def call(self, inputs: dict[str, tf.Tensor]) -> tf.Tensor:\n \"\"\"Call the item model with the given inputs.\"\"\"\n return self.model(inputs)\n\n# Defining this model is not easy with builtin preprocessing layers:\n\nfrom kdp import PreprocessingModel\nfrom kdp import FeatureType\n\n# DEFINING FEATURES PROCESSORS\nfeatures_specs = {\n # ======= NUMERICAL Features =========================\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": FeatureType.FLOAT_RESCALED,\n # ======= CATEGORICAL Features ========================\n \"feat3\": FeatureType.STRING_CATEGORICAL,\n \"feat4\": FeatureType.INTEGER_CATEGORICAL,\n # ======= TEXT Features ========================\n \"feat5\": FeatureType.TEXT,\n}\n\n# INSTANTIATE THE PREPROCESSING MODEL with your data\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n)\n# construct the preprocessing pipelines\nppr.build_preprocessor()\n\n# building a production / deployment ready model\nfull_model = FunctionalModelWithPreprocessing(\n preprocessing_model=ppr.model,\n)\n
"},{"location":"layers_factory/","title":"\ud83c\udfed Preprocessing Layers Factory","text":"You can find all available layers in the PreprocessorLayerFactory
class:
cast_to_float32_layer(name='cast_to_float32', **kwargs)
staticmethod
","text":"Create a CastToFloat32Layer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'cast_to_float32'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the CastToFloat32Layer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.create_layer","title":"create_layer(layer_class, name=None, **kwargs)
staticmethod
","text":"Create a layer using the layer class name, automatically filtering kwargs based on the layer class.
Parameters:
Name Type Description Defaultlayer_class
str | Class Object
The name of the layer class to be created (e.g., 'Normalization', 'Rescaling') or the class object itself.
requiredname
str
The name of the layer. Optional.
None
**kwargs
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the specified layer class.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.date_encoding_layer","title":"date_encoding_layer(name='date_encoding_layer', **kwargs)
staticmethod
","text":"Create a DateEncodingLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'date_encoding_layer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the DateEncodingLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.date_parsing_layer","title":"date_parsing_layer(name='date_parsing_layer', **kwargs)
staticmethod
","text":"Create a DateParsingLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'date_parsing_layer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the DateParsingLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.date_season_layer","title":"date_season_layer(name='date_season_layer', **kwargs)
staticmethod
","text":"Create a SeasonLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'date_season_layer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the SeasonLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.text_preprocessing_layer","title":"text_preprocessing_layer(name='text_preprocessing', **kwargs)
staticmethod
","text":"Create a TextPreprocessingLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'text_preprocessing'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the TextPreprocessingLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.transformer_block_layer","title":"transformer_block_layer(name='transformer', **kwargs)
staticmethod
","text":"Create a TransformerBlock layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'transformer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the TransformerBlock layer.
"},{"location":"motivation/","title":"\ud83c\udf66 The Motivation Behind Keras Data Processor","text":"The burning question now is \u2753:
Why create a new preprocessing pipeline or model when we already have an excellent tool like Keras FeatureSpace?
While Keras FeatureSpace
has been a cornerstone in many of my projects, delivering great results, I encountered significant challenges in a high-volume data project. The tool required multiple data passes (proportional to the number of features), executing .adapt
for each feature. This led to exceedingly long preprocessing times and frequent out-of-memory errors.
This experience motivated a deep dive into the internal workings of Keras FeatureSpace and thus, motivated me to develop a new preprocessing pipeline that could handle data more efficiently\u2014both in terms of speed and memory usage. Thus, the journey began to craft a solution that would:
Process data in a single pass, utilizing an iterative approach to avoid loading the entire dataset into memory, managed by a batch_size parameter.
Introduce custom predefined preprocessing steps tailored to the feature type, controlled by a feature_type parameter.
Offer greater flexibility for custom preprocessing steps and a more Pythonic internal implementation.
Align closely with the API of Keras FeatureSpace (proposing something similar), with the hope that it might eventually be integrated into the KFS ecosystem.
To demonstrate the effectiveness of our new preprocessing pipeline, we conducted a benchmark comparing it with the traditional Keras FeatureSpace (this will give you a glimps on what was described earlier for the big data cases). Here\u2019s how we did it:
Benchmarking Steps:Setup: We configure the benchmark by specifying a set number of features in a loop. Each feature's specification (either a normalized float or a categorical string) is defined in a dictionary.
Data Generation: For each set number of data points determined in another loop, we generate mock data based on the feature specifications and data points, which is then saved to a CSV file.
Memory Management: We use garbage collection to free up memory before and after each benchmarking run, coupled with a 10-second cooldown period to ensure all operations complete fully.
Performance Measurement: For both the Keras Data Processor (KDP) and Keras Feature Space (FS), we measure and record CPU and memory usage before and after their respective functions run, noting the time taken.
Results Compilation: We collect and log results including the number of features, data points, execution time, memory, and CPU usage for each function in a structured format.
The results clearly illustrate the benefits, especially as the complexity of the data increases:
The graph shows a steep rise in processing time with an increase in data points for both KDP
and FS
. However, KDP consistently outperforms FS
, with the gap widening as the number of data points grows.
This graph depicts the processing time increase with more features. Again, KDP
outpaces FS
, demonstrating substantial efficiency improvements.
The combined effect of both the number of features and data points leads to significant performance gains on the KDP
sice and time and memory hungry FS
for the bigger and more complex datasets. This project was born from the need for better efficiency, and it\u2019s my hope to continue refining this tool with community support, pushing the boundaries of what we can achieve in data preprocessing (and maybe one day integrating it directly into Keras \u2764\ufe0f)!
There is much to be done and many features to be added, but I am excited to see where this journey takes us. Let\u2019s build something great together! \ud83d\ude80\ud83d\udd27
"},{"location":"transformer_blocks/","title":"\ud83e\udd16 TransformerBlocks \ud83c\udf1f","text":"You can add transformer blocks to your preprocessing model by simply defining required configuration when initializing the Preprocessor
class:
with the following arguments:
transfo_nr_blocks
(int): The number of transformer blocks in sequence (default=None, transformer block is disabled by default).
transfo_nr_heads
(int): The number of heads for the transformer block (default=3).
transfo_ff_units
(int): The number of feed forward units for the transformer (default=16).
transfo_dropout_rate
(float): The dropout rate for the transformer block (default=0.25).
transfo_placement
(str): The placement of the transformer block withe the following options:
CATEGORICAL
-> only after categorical and text variablesALL_FEATURES
-> after all concatenated features).This used a dedicated TransformerBlockLayer to handle the transformer block logic.
"},{"location":"transformer_blocks/#code-examples","title":"\ud83d\udcbb Code Examples:","text":"from kdp.processor import PreprocessingModel, OutputModeOptions, TransformerBlockPlacementOptions\n\nppr = PreprocessingModel(\n path_data=\"data/test_data.csv\",\n features_specs=features_specs,\n features_stats_path=\"stats_data.json\",\n output_mode=OutputModeOptions.CONCAT,\n # TRANSFORMERS BLOCK CONTROLL\n transfo_nr_blocks=3, # if 0, transformer block is disabled\n transfo_nr_heads=3,\n transfo_ff_units=16,\n transfo_dropout_rate=0.25,\n transfo_placement=TransformerBlockPlacementOptions.ALL_FEATURES,\n
There are two options for the transfo_placement
argument controlled using TransformerBlockPlacementOptions
class:
CATEGORICAL
: The transformer block is applied only to the categorical + text features: TransformerBlockPlacementOptions.CATEGORICAL
only.
The corresponding architecture may thus look like this:
ALL_FEATURES
: The transformer block is applied to all features: TransformerBlockPlacementOptions.ALL_FEATURES
The corresponding architecture may thus look like this:
Welcome to the Future of Data Preprocessing!
Diving into the world of machine learning and data science, we often find ourselves tangled in the preprocessing jungle. Worry no more! Introducing a state-of-the-art data preprocessing model based on TensorFlow Keras and the innovative use of Keras preprocessing layers.
Say goodbye to tedious data preparation tasks and hello to streamlined, efficient, and scalable data pipelines. Whether you're a seasoned data scientist or just starting out, this tool is designed to supercharge your ML workflows, making them more robust and faster than ever!
"},{"location":"#key-features","title":"\ud83d\udd11 Key Features:","text":"\ud83d\udee0 Flexible Feature Engineering: Applies predefined preprocessing steps based on user-specified feature types, allowing for efficient and customizable data preparation with minimal manual coding.
\ud83c\udfa8 Customizable Preprocessing Pipelines: Tailor your preprocessing steps with ease. Choose from a comprehensive range of options for numeric, categorical, text data, and even complex feature crosses, allowing for precise and effective data handling.
\ud83d\udcca Scalability and Efficiency: Engineered for high performance, this tool handles large datasets effortlessly, leveraging TensorFlow's robust computational capabilities.
\ud83e\udde0 Enhanced with Transformer Blocks: Incorporate transformer blocks into your preprocessing model to boost feature interaction analysis and uncover complex patterns, enhancing predictive model accuracy.
\u2699\ufe0f Easy Integration: Designed to seamlessly integrate as the first layers in your TensorFlow Keras models, facilitating a smooth transition from raw data to trained model, accelerating your workflow significantly.
We use poetry for handling dependencies so you will need to install it first. Then you can install the dependencies by running:
To install dependencies:
poetry install\n
or to enter a dedicated env directly:
poetry shell\n
Then you can simply configure your preprocessor:
"},{"location":"#building-preprocessor","title":"\ud83d\udee0\ufe0f Building Preprocessor:","text":"The simplest application of the preprocessing model is as follows:
from kdp import PreprocessingModel\nfrom kdp import FeatureType\n\n# DEFINING FEATURES PROCESSORS\nfeatures_specs = {\n # ======= NUMERICAL Features =========================\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": FeatureType.FLOAT_RESCALED,\n # ======= CATEGORICAL Features ========================\n \"feat3\": FeatureType.STRING_CATEGORICAL,\n \"feat4\": FeatureType.INTEGER_CATEGORICAL,\n # ======= TEXT Features ========================\n \"feat5\": FeatureType.TEXT,\n}\n\n# INSTANTIATE THE PREPROCESSING MODEL with your data\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n)\n# construct the preprocessing pipelines\nppr.build_preprocessor()\n
This will output:
{\n'model': <Functional name=preprocessor, built=True>,\n'inputs': {\n 'feat1': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat1>,\n 'feat2': <KerasTensor shape=(None, 1), dtype=float32, sparse=None, name=feat2>,\n 'feat3': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat3>,\n 'feat4': <KerasTensor shape=(None, 1), dtype=int32, sparse=None, name=feat4>,\n 'feat5': <KerasTensor shape=(None, 1), dtype=string, sparse=None, name=feat5>\n },\n'signature': {\n 'feat1': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat1'),\n 'feat2': TensorSpec(shape=(None, 1), dtype=tf.float32, name='feat2'),\n 'feat3': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat3'),\n 'feat4': TensorSpec(shape=(None, 1), dtype=tf.int32, name='feat4'),\n 'feat5': TensorSpec(shape=(None, 1), dtype=tf.string, name='feat5')\n },\n'output_dims': 45\n}\n
This will result in the following preprocessing steps:
Success
You can define the preprocessing model with the features_specs
dictionary, where the keys are the feature names and the values are the feature types. The model will automatically apply the appropriate preprocessing steps based on the feature type.
You have access to several layers of customization per feature type, such as normalization, rescaling, or even definition of custom preprocessing steps.
See \ud83d\udc40 Defining Features for more details.
Info
You can use the preprocessing model independently to preprocess your data or integrate it into your Keras model as the first layer, see \ud83d\udc40 Integrations
"},{"location":"#advanced-configuration-options","title":"\ud83d\udcaa\ud83c\udffb Advanced Configuration Options","text":""},{"location":"#transformer-blocks-configuration","title":"\ud83e\udd16 Transformer Blocks Configuration","text":"Enhance your preprocessing model with transformer blocks to capture complex patterns and interactions between features, see \ud83d\udc40 Transformer Blocks. You can configure the transformer blocks as follows:
CATEGORICAL
) or to all features (ALL_FEATURES
).Example configuration:
transfo_config = {\n 'transfo_nr_blocks': 3,\n 'transfo_nr_heads': 4,\n 'transfo_ff_units': 64,\n 'transfo_dropout_rate': 0.1,\n 'transfo_placement': 'ALL_FEATURES'\n}\n\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n **transfo_config\n)\n
"},{"location":"#custom-preprocessors","title":"\ud83c\udfd7\ufe0f Custom Preprocessors","text":"Tailor your preprocessing steps with custom preprocessors for each feature type. Define specific preprocessing logic that fits your data characteristics or domain-specific requirements, see \ud83d\udc40 Custom Preprocessors.
Example of adding a custom preprocessor:
from kdp.custom_preprocessors import MyCustomScaler\n\nfeatures_specs = {\n \"feat1\": {\n 'feature_type': FeatureType.FLOAT_NORMALIZED,\n 'preprocessors': [MyCustomScaler()]\n }\n}\n\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec\n)\n
"},{"location":"#feature-crosses","title":"\u2671 Feature Crosses","text":"Create complex feature interactions by crossing features. This method combines features into a single feature, which can be particularly useful for models that benefit from understanding interactions between specific features, see \ud83d\udc40 Feature Crosses.
Example of defining feature crosses:
feature_crosses = [\n (\"feat1\", \"feat2\", 10),\n (\"feat3\", \"feat4\", 5)\n]\n\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n feature_crosses=feature_crosses\n)\n
These advanced configurations allow for greater flexibility and power in your preprocessing pipelines, enabling more sophisticated data transformations and feature engineering.
"},{"location":"contributing/","title":"\ud83d\udcbb Contributing: Join the Preprocessing Revolution! \ud83d\udee0\ufe0f","text":"Eager to contribute? Great! We're excited to welcome new contributors to our project. Here's how you can get involved:
"},{"location":"contributing/#new-ideas-features-requests","title":"\ud83d\udca1 New Ideas / Features Requests","text":"If you wan't to request a new feature or you have detected an issue, please use the following link: ISSUES
"},{"location":"contributing/#getting-started","title":"\ud83d\ude80 Getting Started:","text":"Fork the Repository: Visit our GitHub page, fork the repository, and clone it to your local machine.
Set Up Your Environment: Make sure you have TensorFlow, Loguru, and all necessary dependencies installed.
Make sure you have installed the pre-commit hook locally
??? installation-guide Before using pre-commit hook you need to install it in your python environment.
```bash\n conda install -c conda-forge pre-commit\n ```\n\n go to the root folder of this repository, activate your venv and use the following command:\n\n ```bash\n pre-commit install\n ```\n
Create a new branch to package your code
Use standarized commit message:
{LABEL}(KDP): {message}
This is very important for the automatic releases (semantic release) and to have clean history on the master branch.
??? Labels-types
| Label | Usage |\n | -------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n | break | `break` is used to identify changes related to old compatibility or functionality that breaks the current usage (major) |\n | feat | `feat` is used to identify changes related to new backward-compatible abilities or functionality (minor) |\n | init | `init` is used to indentify the starting related to the project (minor) |\n | enh | `enh` is used to indentify changes related to amelioration of abilities or functionality (patch) |\n | build | `build` (also known as `chore`) is used to identify **development** changes related to the build system (involving scripts, configurations, or tools) and package dependencies (patch) |\n | ci | `ci` is used to identify **development** changes related to the continuous integration and deployment system - involving scripts, configurations, or tools (minor) |\n | docs | `docs` is used to identify documentation changes related to the project; whether intended externally for the end-users or internally for the developers (patch) |\n | perf | `perf` is used to identify changes related to backward-compatible **performance improvements** (patch) |\n | refactor | `refactor` is used to identify changes related to modifying the codebase, which neither adds a feature nor fixes a bug - such as removing redundant code, simplifying the code, renaming variables, etc.<br />i.e. handy for your wip ; ) (patch) |\n | style | `style` is used to identify **development** changes related to styling the codebase, regardless of the meaning - such as indentations, semi-colons, quotes, trailing commas, and so on (patch) |\n | test | `test` is used to identify **development** changes related to tests - such as refactoring existing tests or adding new tests. (minor) |\n | fix | `fix` is used to identify changes related to backward-compatible bug fixes. (patch) |\n | ops | `ops` is used to identify changes related to deployment files like `values.yml`, `gateway.yml,` or `Jenkinsfile` in the **ops** directory. (minor) |\n | hotfix | `hotfix` is used to identify **production** changes related to backward-compatible bug fixes (patch) |\n | revert | `revert` is used to identify backward changes (patch) |\n | maint | `maint` is used to identify **maintenance** changes related to project (patch) |\n
Merge requests will be responsible for semantic-release storytelling and so use them wisely! The changelog report generated automatically will be based on your commits merged into main branch and should cover all the thins you did for the project, as an example:
feat
labelThis about what part of feature you are working on, (messages) i.e.:
- `initializaing base pre-processing code`\n - `init repo structure`\n - `adding pre-processing unit-tests`\n
The name of your MR should follow the same exact convention as your commits (we have a dedicated check for this in the CI):
`{LABEL}(KDP): {message}`\n
Use small Merge Requests but do them more ofthen < 400 ligns for quicker and simple review and not the whole project !
Ask for a Code Review !
Once your MR is approved, solve all your unresolved conversation and pass all the CI check before you can merge it.
All the Tests for your code should pass -> REMEMBER NO TESTS = NO MERGE \ud83d\udea8
Customize the preprocessing pipeline by setting up a dictionary that maps feature names to their respective types, tailored to your specific requirements.
"},{"location":"features/#numeric-features","title":"\ud83d\udcaf Numeric Features","text":"Explore various methods to define numerical features tailored to your needs:
\u2139\ufe0f Simple Declaration\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom NumericalFeaturefeatures_specs = {\n \"feat1\": \"float\",\n \"feat2\": \"FLOAT\",\n \"feat3\": \"FLOAT_NORMALIZED\",\n \"feat3\": \"FLOAT_RESCALED\",\n ...\n}\n
Utilize predefined preprocessing configurations with FeatureType
.
from kdp.features import FeatureType\n\nfeatures_specs = {\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": FeatureType.FLOAT_RESCALED,\n ...\n}\n
Available FeatureType
options:
Customize preprocessing by passing specific parameters to NumericalFeature
.
from kdp.features import NumericalFeature\n\nfeatures_specs = {\n \"feat3\": NumericalFeature(\n name=\"feat3\",\n feature_type=FeatureType.FLOAT_DISCRETIZED,\n bin_boundaries=[(1, 10)],\n ),\n \"feat4\": NumericalFeature(\n name=\"feat4\",\n feature_type=FeatureType.FLOAT,\n ),\n ...\n}\n
Here's how the numeric preprocessing pipeline looks:
"},{"location":"features/#categorical-features","title":"\ud83d\udc08\u200d\u2b1b Categorical Features","text":"Define categorical features flexibly:
\u2139\ufe0f Simple Declaration\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom CategoricalFeaturefeatures_specs = {\n \"feat1\": \"INTEGER_CATEGORICAL\",\n \"feat2\": \"STRING_CATEGORICAL\",\n \"feat3\": \"string_categorical\",\n ...\n}\n
Leverage default configurations with FeatureType
.
from kdp.features import FeatureType\n\nfeatures_specs = {\n \"feat1\": FeatureType.INTEGER_CATEGORICAL,\n \"feat2\": FeatureType.STRING_CATEGORICAL,\n ...\n}\n
Available FeatureType
options:
Tailor feature processing by specifying properties in CategoricalFeature
.
from kdp.features\nfrom kdp.features import CategoricalFeature\n\nfeatures_specs = {\n \"feat1\": CategoricalFeature(\n name=\"feat7\",\n feature_type=FeatureType.INTEGER_CATEGORICAL,\n embedding_size=100,\n ),\n \"feat2\": CategoricalFeature(\n name=\"feat2\",\n feature_type=FeatureType.STRING_CATEGORICAL,\n ),\n ...\n}\n
See how the categorical preprocessing pipeline appears:
"},{"location":"features/#text-features","title":"\ud83d\udcdd Text Features","text":"Customize text features in multiple ways to fit your project's demands:
\u2139\ufe0f Simple Declaration\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom TextFeaturefeatures_specs = {\n \"feat1\": \"text\",\n \"feat2\": \"TEXT\",\n ...\n}\n
Use FeatureType
for automatic default preprocessing setups.
from kdp.features import FeatureType\n\nfeatures_specs = {\n \"feat1\": FeatureType.TEXT,\n \"feat2\": FeatureType.TEXT,\n ...\n}\n
Available FeatureType
options:
Customize text preprocessing by passing specific arguments to TextFeature
.
from kdp.features import TextFeature\n\nfeatures_specs = {\n \"feat1\": TextFeature(\n name=\"feat2\",\n feature_type=FeatureType.TEXT,\n max_tokens=100,\n stop_words=[\"stop\", \"next\"],\n ),\n \"feat2\": TextFeature(\n name=\"feat2\",\n feature_type=FeatureType.TEXT,\n ),\n ...\n}\n
Here's how the text feature preprocessing pipeline looks:
"},{"location":"features/#cross-features","title":"\u274c Cross Features","text":"To implement cross features, specify a list of feature tuples in the PreprocessingModel
like so:
from kdp.processor import PreprocessingModel\n\nppr = PreprocessingModel(\n path_data=\"data/data.csv\",\n features_specs={\n \"feat6\": FeatureType.STRING_CATEGORICAL,\n \"feat7\": FeatureType.INTEGER_CATEGORICAL,\n },\n feature_crosses=[(\"feat6\", \"feat7\", 5)],\n)\n
Example cross feature between INTEGER_CATEGORICAL and STRING_CATEGORICAL:
"},{"location":"features/#date-features","title":"\ud83d\udcc6 Date Features","text":"You can even process string encoded date features (format: 'YYYY-MM-DD'):
\ud83d\udd27 Using FeatureType\ud83d\udcaa Custom DateFeatureUse FeatureType
for automatic default preprocessing setups.
from kdp.processor import PreprocessingModel\n\nppr = PreprocessingModel(\n path_data=\"data/data.csv\",\n features_specs={\n \"feat1\": FeatureType.FLOAT,\n \"feat2\": FeatureType.DATE,\n },\n)\n
Customize text preprocessing by passing specific arguments to TextFeature
.
from kdp.features import DateFeature\n\nfeatures_specs = {\n \"feat1\": DateFeature(\n name=\"feat2\",\n feature_type=FeatureType.FLOAT,\n ),\n \"feat2\": TextFeature(\n name=\"feat2\",\n feature_type=FeatureType.DATE,\n # additional option to add season layer:\n add_season=True, # adds one-hot season indicator (summer, winter, etc) defaults to False\n ),\n ...\n}\n
Example date and numeric processing pipeline:
"},{"location":"features/#custom-preprocessing-steps","title":"\ud83d\ude80 Custom Preprocessing Steps","text":"If you require even more customization, you can define custom preprocessing steps using the Feature
class, using preprocessors
attribute.
The preprocessors
attribute accepts a list of methods defined in PreprocessorLayerFactory
.
from kdp.features import Feature\n\nfeatures_specs = {\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": Feature(\n name=\"custom_feature_pipeline\",\n feature_type=FeatureType.FLOAT_NORMALIZED,\n preprocessors=[\n tf.keras.layers.Rescaling,\n tf.keras.layers.Normalization,\n\n ],\n # leyers required kwargs\n scale=1,\n )\n}\n
Here's how the text feature preprocessing pipeline looks:
The full list of available layers can be found: Preprocessing Layers Factory
"},{"location":"integrations/","title":"\ud83d\udd17 Integrating Preprocessing Model with other Keras Model:","text":"You can then easily ingetrate this model into your keras model as the first layer:
"},{"location":"integrations/#example-1-using-the-preprocessing-model-as-the-first-layer-of-a-sequential-model","title":"Example 1: Using the Preprocessing Model as the first layer of a Sequential Model","text":"class FunctionalModelWithPreprocessing(tf.keras.Model):\n def __init__(self, preprocessing_model: tf.keras.Model) -> None:\n \"\"\"Initialize the user model.\n\n Args:\n preprocessing_model (tf.keras.Model): The preprocessing model.\n \"\"\"\n super().__init__()\n self.preprocessing_model = preprocessing_model\n\n # Dynamically create inputs based on the preprocessing model's input shape\n inputs = {\n name: tf.keras.Input(shape=shape[1:], name=name)\n for name, shape in self.preprocessing_model.input_shape.items()\n }\n\n # You can use the preprocessing model directly in the functional API.\n x = self.preprocessing_model(inputs)\n\n # Define the dense layer as part of the model architecture\n output = tf.keras.layers.Dense(\n units=128,\n activation=\"relu\",\n )(x)\n\n # Use the Model's functional API to define inputs and outputs\n self.model = tf.keras.Model(inputs=inputs, outputs=output)\n\n def call(self, inputs: dict[str, tf.Tensor]) -> tf.Tensor:\n \"\"\"Call the item model with the given inputs.\"\"\"\n return self.model(inputs)\n\n# Defining this model is not easy with builtin preprocessing layers:\n\nfrom kdp import PreprocessingModel\nfrom kdp import FeatureType\n\n# DEFINING FEATURES PROCESSORS\nfeatures_specs = {\n # ======= NUMERICAL Features =========================\n \"feat1\": FeatureType.FLOAT_NORMALIZED,\n \"feat2\": FeatureType.FLOAT_RESCALED,\n # ======= CATEGORICAL Features ========================\n \"feat3\": FeatureType.STRING_CATEGORICAL,\n \"feat4\": FeatureType.INTEGER_CATEGORICAL,\n # ======= TEXT Features ========================\n \"feat5\": FeatureType.TEXT,\n}\n\n# INSTANTIATE THE PREPROCESSING MODEL with your data\nppr = PreprocessingModel(\n path_data=\"data/my_data.csv\",\n features_specs=features_spec,\n)\n# construct the preprocessing pipelines\nppr.build_preprocessor()\n\n# building a production / deployment ready model\nfull_model = FunctionalModelWithPreprocessing(\n preprocessing_model=ppr.model,\n)\n
"},{"location":"layers_factory/","title":"\ud83c\udfed Preprocessing Layers Factory","text":"The PreprocessorLayerFactory
class provides a convenient way to create and manage preprocessing layers for your machine learning models. It supports both standard Keras preprocessing layers and custom layers defined within the KDP framework.
All preprocessing layers available in Keras can be used within the PreprocessorLayerFactory
. You can access these layers by their class names. Here's an example of how to use a Keras preprocessing layer:
normalization_layer = PreprocessorLayerFactory.create_layer(\n \"Normalization\",\n axis=-1,\n mean=None,\n variance=None\n)\n
Available layers: In addition to Keras layers, the PreprocessorLayerFactory
includes several custom layers specific to the KDP framework. Here's a list of available custom layers:
cast_to_float32_layer(name='cast_to_float32', **kwargs)
staticmethod
","text":"Create a CastToFloat32Layer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'cast_to_float32'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the CastToFloat32Layer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.create_layer","title":"create_layer(layer_class, name=None, **kwargs)
staticmethod
","text":"Create a layer using the layer class name, automatically filtering kwargs based on the layer class.
Parameters:
Name Type Description Defaultlayer_class
str | Class Object
The name of the layer class to be created (e.g., 'Normalization', 'Rescaling') or the class object itself.
requiredname
str
The name of the layer. Optional.
None
**kwargs
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the specified layer class.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.date_encoding_layer","title":"date_encoding_layer(name='date_encoding_layer', **kwargs)
staticmethod
","text":"Create a DateEncodingLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'date_encoding_layer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the DateEncodingLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.date_parsing_layer","title":"date_parsing_layer(name='date_parsing_layer', **kwargs)
staticmethod
","text":"Create a DateParsingLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'date_parsing_layer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the DateParsingLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.date_season_layer","title":"date_season_layer(name='date_season_layer', **kwargs)
staticmethod
","text":"Create a SeasonLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'date_season_layer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the SeasonLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.text_preprocessing_layer","title":"text_preprocessing_layer(name='text_preprocessing', **kwargs)
staticmethod
","text":"Create a TextPreprocessingLayer layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'text_preprocessing'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the TextPreprocessingLayer layer.
"},{"location":"layers_factory/#kdp.layers_factory.PreprocessorLayerFactory.transformer_block_layer","title":"transformer_block_layer(name='transformer', **kwargs)
staticmethod
","text":"Create a TransformerBlock layer.
Parameters:
Name Type Description Defaultname
str
The name of the layer.
'transformer'
**kwargs
dict
Additional keyword arguments to pass to the layer constructor.
{}
Returns:
Type DescriptionLayer
An instance of the TransformerBlock layer.
"},{"location":"motivation/","title":"\ud83c\udf66 The Motivation Behind Keras Data Processor","text":"The burning question now is \u2753:
Why create a new preprocessing pipeline or model when we already have an excellent tool like Keras FeatureSpace?
While Keras FeatureSpace
has been a cornerstone in many of my projects, delivering great results, I encountered significant challenges in a high-volume data project. The tool required multiple data passes (proportional to the number of features), executing .adapt
for each feature. This led to exceedingly long preprocessing times and frequent out-of-memory errors.
This experience motivated a deep dive into the internal workings of Keras FeatureSpace and thus, motivated me to develop a new preprocessing pipeline that could handle data more efficiently\u2014both in terms of speed and memory usage. Thus, the journey began to craft a solution that would:
Process data in a single pass, utilizing an iterative approach to avoid loading the entire dataset into memory, managed by a batch_size parameter.
Introduce custom predefined preprocessing steps tailored to the feature type, controlled by a feature_type parameter.
Offer greater flexibility for custom preprocessing steps and a more Pythonic internal implementation.
Align closely with the API of Keras FeatureSpace (proposing something similar), with the hope that it might eventually be integrated into the KFS ecosystem.
To demonstrate the effectiveness of our new preprocessing pipeline, we conducted a benchmark comparing it with the traditional Keras FeatureSpace (this will give you a glimps on what was described earlier for the big data cases). Here\u2019s how we did it:
Benchmarking Steps:Setup: We configure the benchmark by specifying a set number of features in a loop. Each feature's specification (either a normalized float or a categorical string) is defined in a dictionary.
Data Generation: For each set number of data points determined in another loop, we generate mock data based on the feature specifications and data points, which is then saved to a CSV file.
Memory Management: We use garbage collection to free up memory before and after each benchmarking run, coupled with a 10-second cooldown period to ensure all operations complete fully.
Performance Measurement: For both the Keras Data Processor (KDP) and Keras Feature Space (FS), we measure and record CPU and memory usage before and after their respective functions run, noting the time taken.
Results Compilation: We collect and log results including the number of features, data points, execution time, memory, and CPU usage for each function in a structured format.
The results clearly illustrate the benefits, especially as the complexity of the data increases:
The graph shows a steep rise in processing time with an increase in data points for both KDP
and FS
. However, KDP consistently outperforms FS
, with the gap widening as the number of data points grows.
This graph depicts the processing time increase with more features. Again, KDP
outpaces FS
, demonstrating substantial efficiency improvements.
The combined effect of both the number of features and data points leads to significant performance gains on the KDP
sice and time and memory hungry FS
for the bigger and more complex datasets. This project was born from the need for better efficiency, and it\u2019s my hope to continue refining this tool with community support, pushing the boundaries of what we can achieve in data preprocessing (and maybe one day integrating it directly into Keras \u2764\ufe0f)!
There is much to be done and many features to be added, but I am excited to see where this journey takes us. Let\u2019s build something great together! \ud83d\ude80\ud83d\udd27
"},{"location":"transformer_blocks/","title":"\ud83e\udd16 TransformerBlocks \ud83c\udf1f","text":"You can add transformer blocks to your preprocessing model by simply defining required configuration when initializing the Preprocessor
class:
with the following arguments:
transfo_nr_blocks
(int): The number of transformer blocks in sequence (default=None, transformer block is disabled by default).
transfo_nr_heads
(int): The number of heads for the transformer block (default=3).
transfo_ff_units
(int): The number of feed forward units for the transformer (default=16).
transfo_dropout_rate
(float): The dropout rate for the transformer block (default=0.25).
transfo_placement
(str): The placement of the transformer block withe the following options:
CATEGORICAL
-> only after categorical and text variablesALL_FEATURES
-> after all concatenated features).This used a dedicated TransformerBlockLayer to handle the transformer block logic.
"},{"location":"transformer_blocks/#code-examples","title":"\ud83d\udcbb Code Examples:","text":"from kdp.processor import PreprocessingModel, OutputModeOptions, TransformerBlockPlacementOptions\n\nppr = PreprocessingModel(\n path_data=\"data/test_data.csv\",\n features_specs=features_specs,\n features_stats_path=\"stats_data.json\",\n output_mode=OutputModeOptions.CONCAT,\n # TRANSFORMERS BLOCK CONTROLL\n transfo_nr_blocks=3, # if 0, transformer block is disabled\n transfo_nr_heads=3,\n transfo_ff_units=16,\n transfo_dropout_rate=0.25,\n transfo_placement=TransformerBlockPlacementOptions.ALL_FEATURES,\n
There are two options for the transfo_placement
argument controlled using TransformerBlockPlacementOptions
class:
CATEGORICAL
: The transformer block is applied only to the categorical + text features: TransformerBlockPlacementOptions.CATEGORICAL
only.
The corresponding architecture may thus look like this:
ALL_FEATURES
: The transformer block is applied to all features: TransformerBlockPlacementOptions.ALL_FEATURES
The corresponding architecture may thus look like this: