A data pipeline to extract Reddit data from r/dataengineering.
Output is a Snowflake dashboard, providing insight into the Data Engineering official subreddit.
- Extract data using Reddit API
- Load into AWS S3
- Copy into Snowflake
- Transform using dbt
- Create Snowflake Dashboard
- Orchestrate the above via a Python script
- Final output in form of Snowflake dash (may need permissions to view). Link here.
The code in this repo handles the steps up until the transformation via dbt. dbt steps onward were configured separately (e.g. building the Snowflake dashboard). To use this pipeline in production, the execution of dbt models could be brought into this workflow (orchestrated via the Python scripts) or executed on a schedule.
The Python script reddit_pipeline.py
acts as the orchestrator for this pipeline, executing the following Python scripts for each stage of the pipeline:
extract_reddit_etl.py
to extract data from Reddit API and save as CSVupload_aws_s3.etl.py
to upload CSV to S3upload_to_snowflake_etl.py
to copy data to Snowflake
This means the full pipeline can be run by simply running the command python reddit_pipeline.py <output-file-name>
(where is the name of the CSV to store in S3).
Each step of the pipeline can also be run individually if needed with the same approach: python <name-of-script.py> <output-file-name>
.
For a production environment, we'd certainly want to use a more advanced implementation for orchestration, such as Airflow.
As a best practice, I used a configuration.conf file to store all sensitive info (credentials for AWS, Snowflake, Reddit, etc.). To recreate this pipeline, you
should do the same, with a configuration.conf
file (and include it in your .gitignore to avoid exposing info) in the form of:
```
[aws_config]
bucket_name = xxxxxx
account_id = xxxxxx
aws_region = xxxxxx
[reddit_config]
secret = xxxxxx
developer = xxxxxx
name = xxxxxx
client_id = xxxxxx
[snowflake_config]
username = xxxxxx
password = xxxxxx
account = xxxxxx
```