Template from ownml.co
- Terraform is used to provision Google Cloud Storage (GCS) buckets for data storage and a virtual machine for MLflow
- Kaggle data related to Car Insurance Fraud claims is used
- Google Cloud Storage is used as a data lake to store the raw Kaggle data
- PostgreSQL (and pgAdmin) is used as a data mart to store processed data, and data ready for model training
- Prefect and Prefect Cloud is used for orchestration. Flows used:
- upload raw Kaggle data to GCS
- preprocess data and provide a simple check that meaningful features are consistent
- train a new model
- do batch model prediction
- update Evidently monitoring artifacts (HTMLs + PNGs)
- MLflow is used for experiment tracking. It is hosted on a GCP VM
- Docker is used to run Grafana, PostgreSQL, pgAdmin, and FastAPI (for optional web-service deployment)
- Grafana is used to create a dashboard using data from the postgres database
- Evidently is used to create data (to compare current vs reference data distributions) and monitoring html reports which are hosted using Streamlit
- Other features:
- Python documentation using sphinx
- Makefile for easy setup and start
- git pre-commit hooks
- pytests
- Target variable is FraudFound_P - 0 for Not Fraud, and 1 for Fraud
- Information Value (IV) and Weight of Evidence (WoE) were used for feature engineering (feature_n_model_exploration/feature_eng.ipynb) and this achieved a feature count decrease from 32 to 13
- Selection criteria: highest recall. Reasoning: recognise the most Fraud cases
- The best model ended up being a Balanced Random Forest Classifier, achieving 92-96% recall
- Code in feature_n_model_exploration/experiment_tracking.ipynb
-
Evidently dashboard (public link). It includes:
- reference vs current dataset comparison dashboard
- data test dashboard
- SHAP values
- clone the repository
https://github.com/divakaivan/insurance-fraud-mlops-pipeline.git
- add environment variables and rename
sample.env
to.env
. This is used for running the Docker services - type
make
in the terminal and you should see something similar to set up the project. You need to have a Prefect Cloud account and already logged in to serve the prefect flows
Usage: make [option]
Options:
help Show this help message
gcp-setup View GCP resources to be created (buckets for mlflow artifacts, raw data, and start a VM that runs mlflow on start)
gcp-create Create GCP resources
prefect-serve-cloud Serve Model Train and Load to GCS flows to Prefect Cloud
build-all Build image with PostgreSQL, pgAdmin, Grafana, Data upload to db, FastAPI
start-all Start services
monitoring Update monitoring artifacts
NOTE! If you update the monitoring UI artifacts, you have to push them to GitHub to update the hosted UI
prefect-serve-cloud and start-all are not run in detached mode
- try more model ensambles and sampling algorithms (i.e. ADASYN)
- adding more (data/code/model stress) tests
- simulating a new dataset and using it on the batch predict flow
- more complex monitoring dashboards using Grafana and Evidently
I developed the current version (as of 8th of July 2024) project in the span of 8 days and discussed each day in my self-study blog:
- Day 182: Learning about feature selection in fraud detection and finding a classifier model with low recall
- Day 183: Failing to install Kubeflow, and setting up mlflow on GCP
- Day 184: Mlflow experiment tracking and trying out metaflow
- Day 185: Using prefect as my orchestrator for my MLOps project
- Day 186: Prefect cloud, model serving with FastAPI, and SHAP values
- Day 187: Setting up postgres, pgAdmin, Grafana and FastAPI to run in Docker
- Day 188: Setting up automatically updated monitoring UI using streamlit
- Day 189: I finished the Car Insurance Fraud MLOps project. Thank you MLOps zoomcamp for teaching me so much!