Project diagram

Raw Lending Club data from Kaggle
Mage is used to orchestrate an end to end process including:
- extract data using kaggle's API and load it to the Google Cloud Storage (used as a data lake)
- create tables in BigQuery (used as a data warehouse)
- run dbt transformation jobs
Terraform is used to manage and provision the infrastructure needed for the data pipeline on Google Cloud Platform
dbt is used to transform the data into dimension tables, add data tests, and create data documentation
Looker is used to create a visualisation dashboard

Visualisation dashboard

Click here for an interactive version in Looker

Data lineage overview

For a full view, visit the project's data documentation generated by dbt.

Mage pipeline overview

The below pipeline takes raw data from kaggle and outputs data ready to be visualised in Looker.

Reproducability

Clone the repository

https://github.com/divakaivan/lending-club-data-pipeline.git

Go to the repository folder in your terminal, and type make
Follow the on-screen instructions to set up GCP resources and start Mage (http://localhost:6789/)

If running for the first time, run in order 1~5
Usage: make [option]

Options:
  help                 Show this help message
  gcp-tf-init          1. Initialize GCP resources
  gcp-tf-plan          2. See GCP resources to be created
  gcp-tf-apply         3. Create GCP resources
  docker-build         4. Build Mage environment
  docker-up            5. Start Mage environment
  docker-down          6. Stop Mage environment

Make sure to place your kaggle.json and gcp-creds.json files in terraform/keys/ so that Terraform and Mage can access them.

Things to consider for improvements

increase data volume: at the moment it is using a dataset with ~400K observations.
data modelling: at the moment the final result are only 3 tables related to loans, borrower and date; more complex data models can be created
automate documentation hosting: the current one was hosted manually with Netlify file upload
creating a more complex Looker dashboard

Blog posts about this project

I developed the current version (as of 30th of June 2024) project in the span of 4 days and discussed each day in my self-study blog:

Day 178: Starting 'Lending club data engineering project'
Day 179: Using Docker, Makefile, and starting Data modelling for my Lending club project
Day 180: From Kaggle to BigQuery dimension tables - an end2end pipeline
Day 181: Lending club data engineering project - Done

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
dbt_models		dbt_models
mage/mage-orchestration		mage/mage-orchestration
project_info		project_info
terraform		terraform
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
sample.env		sample.env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project diagram

Visualisation dashboard

Click here for an interactive version in Looker

Data lineage overview

For a full view, visit the project's data documentation generated by dbt.

Mage pipeline overview

Reproducability

Things to consider for improvements

Blog posts about this project

About

Contributors 2

Languages

divakaivan/lending-club-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Project diagram

Visualisation dashboard

Click here for an interactive version in Looker

Data lineage overview

For a full view, visit the project's data documentation generated by dbt.

Mage pipeline overview

Reproducability

Things to consider for improvements

Blog posts about this project

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages