- Raw Lending Club data from Kaggle
- Mage is used to orchestrate an end to end process including:
- extract data using kaggle's API and load it to the Google Cloud Storage (used as a data lake)
- create tables in BigQuery (used as a data warehouse)
- run dbt transformation jobs
- Terraform is used to manage and provision the infrastructure needed for the data pipeline on Google Cloud Platform
- dbt is used to transform the data into dimension tables, add data tests, and create data documentation
- Looker is used to create a visualisation dashboard
Click here for an interactive version in Looker
For a full view, visit the project's data documentation generated by dbt.
The below pipeline takes raw data from kaggle and outputs data ready to be visualised in Looker.
- Clone the repository
https://github.com/divakaivan/lending-club-data-pipeline.git
-
Go to the repository folder in your terminal, and type
make
-
Follow the on-screen instructions to set up GCP resources and start Mage (
http://localhost:6789/
)
If running for the first time, run in order 1~5
Usage: make [option]
Options:
help Show this help message
gcp-tf-init 1. Initialize GCP resources
gcp-tf-plan 2. See GCP resources to be created
gcp-tf-apply 3. Create GCP resources
docker-build 4. Build Mage environment
docker-up 5. Start Mage environment
docker-down 6. Stop Mage environment
Make sure to place your kaggle.json and gcp-creds.json files in terraform/keys/ so that Terraform and Mage can access them.
- increase data volume: at the moment it is using a dataset with ~400K observations.
- data modelling: at the moment the final result are only 3 tables related to loans, borrower and date; more complex data models can be created
- automate documentation hosting: the current one was hosted manually with Netlify file upload
- creating a more complex Looker dashboard
I developed the current version (as of 30th of June 2024) project in the span of 4 days and discussed each day in my self-study blog: