Skip to content

divakaivan/lending-club-data-pipeline

Repository files navigation

Project diagram

Project diagram

  • Raw Lending Club data from Kaggle
  • Mage is used to orchestrate an end to end process including:
    • extract data using kaggle's API and load it to the Google Cloud Storage (used as a data lake)
    • create tables in BigQuery (used as a data warehouse)
    • run dbt transformation jobs
  • Terraform is used to manage and provision the infrastructure needed for the data pipeline on Google Cloud Platform
  • dbt is used to transform the data into dimension tables, add data tests, and create data documentation
  • Looker is used to create a visualisation dashboard

Visualisation dashboard

Click here for an interactive version in Looker

Visualisation dashboard

Data lineage overview

For a full view, visit the project's data documentation generated by dbt.

Data lineage overview

Mage pipeline overview

The below pipeline takes raw data from kaggle and outputs data ready to be visualised in Looker.

Mage pipeline overview

Reproducability

  1. Clone the repository
https://github.com/divakaivan/lending-club-data-pipeline.git
  1. Go to the repository folder in your terminal, and type make

  2. Follow the on-screen instructions to set up GCP resources and start Mage (http://localhost:6789/)

If running for the first time, run in order 1~5
Usage: make [option]

Options:
  help                 Show this help message
  gcp-tf-init          1. Initialize GCP resources
  gcp-tf-plan          2. See GCP resources to be created
  gcp-tf-apply         3. Create GCP resources
  docker-build         4. Build Mage environment
  docker-up            5. Start Mage environment
  docker-down          6. Stop Mage environment

Make sure to place your kaggle.json and gcp-creds.json files in terraform/keys/ so that Terraform and Mage can access them.

Things to consider for improvements

  • increase data volume: at the moment it is using a dataset with ~400K observations.
  • data modelling: at the moment the final result are only 3 tables related to loans, borrower and date; more complex data models can be created
  • automate documentation hosting: the current one was hosted manually with Netlify file upload
  • creating a more complex Looker dashboard

Blog posts about this project

I developed the current version (as of 30th of June 2024) project in the span of 4 days and discussed each day in my self-study blog:

  • Day 178: Starting 'Lending club data engineering project'
  • Day 179: Using Docker, Makefile, and starting Data modelling for my Lending club project
  • Day 180: From Kaggle to BigQuery dimension tables - an end2end pipeline
  • Day 181: Lending club data engineering project - Done