Machine Learning Canvas

Template from ownml.co

Project diagram

Terraform is used to provision Google Cloud Storage (GCS) buckets for data storage and a virtual machine for MLflow
Kaggle data related to Car Insurance Fraud claims is used
Google Cloud Storage is used as a data lake to store the raw Kaggle data
PostgreSQL (and pgAdmin) is used as a data mart to store processed data, and data ready for model training
Prefect and Prefect Cloud is used for orchestration. Flows used:
- upload raw Kaggle data to GCS
- preprocess data and provide a simple check that meaningful features are consistent
- train a new model
- do batch model prediction
- update Evidently monitoring artifacts (HTMLs + PNGs)
MLflow is used for experiment tracking. It is hosted on a GCP VM
Docker is used to run Grafana, PostgreSQL, pgAdmin, and FastAPI (for optional web-service deployment)
Grafana is used to create a dashboard using data from the postgres database
Evidently is used to create data (to compare current vs reference data distributions) and monitoring html reports which are hosted using Streamlit
Other features:
- Python documentation using sphinx
- Makefile for easy setup and start
- git pre-commit hooks
- pytests

Feature engineering

Target variable is FraudFound_P - 0 for Not Fraud, and 1 for Fraud
Information Value (IV) and Weight of Evidence (WoE) were used for feature engineering (feature_n_model_exploration/feature_eng.ipynb) and this achieved a feature count decrease from 32 to 13

Model selection

Selection criteria: highest recall. Reasoning: recognise the most Fraud cases
The best model ended up being a Balanced Random Forest Classifier, achieving 92-96% recall
Code in feature_n_model_exploration/experiment_tracking.ipynb

Monitoring

Evidently dashboard (public link). It includes:
- reference vs current dataset comparison dashboard
- data test dashboard
- SHAP values
Grafana dashboard

Reproducability

clone the repository

https://github.com/divakaivan/insurance-fraud-mlops-pipeline.git

add environment variables and rename sample.env to .env. This is used for running the Docker services
type make in the terminal and you should see something similar to set up the project. You need to have a Prefect Cloud account and already logged in to serve the prefect flows

Usage: make [option]

Options:
  help                 Show this help message
  gcp-setup            View GCP resources to be created (buckets for mlflow artifacts, raw data, and start a VM that runs mlflow on start)
  gcp-create           Create GCP resources 
  prefect-serve-cloud  Serve Model Train and Load to GCS flows to Prefect Cloud
  build-all            Build image with PostgreSQL, pgAdmin, Grafana, Data upload to db, FastAPI
  start-all            Start services
  monitoring           Update monitoring artifacts 

NOTE! If you update the monitoring UI artifacts, you have to push them to GitHub to update the hosted UI

prefect-serve-cloud and start-all are not run in detached mode

Things to consider for improvements

try more model ensambles and sampling algorithms (i.e. ADASYN)
adding more (data/code/model stress) tests
simulating a new dataset and using it on the batch predict flow
more complex monitoring dashboards using Grafana and Evidently

Blog posts about this project

I developed the current version (as of 8th of July 2024) project in the span of 8 days and discussed each day in my self-study blog:

Day 182: Learning about feature selection in fraud detection and finding a classifier model with low recall
Day 183: Failing to install Kubeflow, and setting up mlflow on GCP
Day 184: Mlflow experiment tracking and trying out metaflow
Day 185: Using prefect as my orchestrator for my MLOps project
Day 186: Prefect cloud, model serving with FastAPI, and SHAP values
Day 187: Setting up postgres, pgAdmin, Grafana and FastAPI to run in Docker
Day 188: Setting up automatically updated monitoring UI using streamlit
Day 189: I finished the Car Insurance Fraud MLOps project. Thank you MLOps zoomcamp for teaching me so much!

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
fastapi_serving		fastapi_serving
feature_n_model_exploration		feature_n_model_exploration
mlflow		mlflow
monitoring		monitoring
postgres		postgres
prefect_orchestration		prefect_orchestration
project_info		project_info
python_docs		python_docs
terraform		terraform
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
Dockerfile-data		Dockerfile-data
Dockerfile-fastapi		Dockerfile-fastapi
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt
sample.env		sample.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Canvas

Project diagram

Feature engineering

Model selection

Monitoring

Reproducability

Things to consider for improvements

Blog posts about this project

About

Languages

divakaivan/insurance-fraud-mlops-pipeline

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Canvas

Project diagram

Feature engineering

Model selection

Monitoring

Reproducability

Things to consider for improvements

Blog posts about this project

About

Resources

Stars

Watchers

Forks

Languages