Skip to content

divakaivan/insurance-fraud-mlops-pipeline

Repository files navigation

Machine Learning Canvas

image

Template from ownml.co

Project diagram

Project diagram

  • Terraform is used to provision Google Cloud Storage (GCS) buckets for data storage and a virtual machine for MLflow
  • Kaggle data related to Car Insurance Fraud claims is used
  • Google Cloud Storage is used as a data lake to store the raw Kaggle data
  • PostgreSQL (and pgAdmin) is used as a data mart to store processed data, and data ready for model training
  • Prefect and Prefect Cloud is used for orchestration. Flows used:
    • upload raw Kaggle data to GCS
    • preprocess data and provide a simple check that meaningful features are consistent
    • train a new model
    • do batch model prediction
    • update Evidently monitoring artifacts (HTMLs + PNGs)
  • MLflow is used for experiment tracking. It is hosted on a GCP VM
  • Docker is used to run Grafana, PostgreSQL, pgAdmin, and FastAPI (for optional web-service deployment)
  • Grafana is used to create a dashboard using data from the postgres database
  • Evidently is used to create data (to compare current vs reference data distributions) and monitoring html reports which are hosted using Streamlit
  • Other features:

Feature engineering

  • Target variable is FraudFound_P - 0 for Not Fraud, and 1 for Fraud
  • Information Value (IV) and Weight of Evidence (WoE) were used for feature engineering (feature_n_model_exploration/feature_eng.ipynb) and this achieved a feature count decrease from 32 to 13

Model selection

Monitoring

  • Evidently dashboard (public link). It includes:

    • reference vs current dataset comparison dashboard
    • data test dashboard
    • SHAP values
  • Grafana dashboard Grafana dashboard

Reproducability

  • clone the repository
https://github.com/divakaivan/insurance-fraud-mlops-pipeline.git
  • add environment variables and rename sample.env to .env. This is used for running the Docker services
  • type make in the terminal and you should see something similar to set up the project. You need to have a Prefect Cloud account and already logged in to serve the prefect flows
Usage: make [option]

Options:
  help                 Show this help message
  gcp-setup            View GCP resources to be created (buckets for mlflow artifacts, raw data, and start a VM that runs mlflow on start)
  gcp-create           Create GCP resources 
  prefect-serve-cloud  Serve Model Train and Load to GCS flows to Prefect Cloud
  build-all            Build image with PostgreSQL, pgAdmin, Grafana, Data upload to db, FastAPI
  start-all            Start services
  monitoring           Update monitoring artifacts 

NOTE! If you update the monitoring UI artifacts, you have to push them to GitHub to update the hosted UI

prefect-serve-cloud and start-all are not run in detached mode

Things to consider for improvements

  • try more model ensambles and sampling algorithms (i.e. ADASYN)
  • adding more (data/code/model stress) tests
  • simulating a new dataset and using it on the batch predict flow
  • more complex monitoring dashboards using Grafana and Evidently

Blog posts about this project

I developed the current version (as of 8th of July 2024) project in the span of 8 days and discussed each day in my self-study blog:

  • Day 182: Learning about feature selection in fraud detection and finding a classifier model with low recall
  • Day 183: Failing to install Kubeflow, and setting up mlflow on GCP
  • Day 184: Mlflow experiment tracking and trying out metaflow
  • Day 185: Using prefect as my orchestrator for my MLOps project
  • Day 186: Prefect cloud, model serving with FastAPI, and SHAP values
  • Day 187: Setting up postgres, pgAdmin, Grafana and FastAPI to run in Docker
  • Day 188: Setting up automatically updated monitoring UI using streamlit
  • Day 189: I finished the Car Insurance Fraud MLOps project. Thank you MLOps zoomcamp for teaching me so much!

About

Capstone project for DataTalksClub's MLOps zoomcamp. Check my blog:

Resources

Stars

Watchers

Forks