Reddit ETL Pipeline

A data pipeline to extract Reddit data from r/dataengineering.

Output is a Snowflake dashboard, providing insight into the Data Engineering official subreddit.

Architecture

Extract data using Reddit API
Load into AWS S3
Copy into Snowflake
Transform using dbt
Create Snowflake Dashboard
Orchestrate the above via a Python script

Output

Final output in form of Snowflake dash (may need permissions to view). Link here.

Pipeline Orchestration Details

The code in this repo handles the steps up until the transformation via dbt. dbt steps onward were configured separately (e.g. building the Snowflake dashboard). To use this pipeline in production, the execution of dbt models could be brought into this workflow (orchestrated via the Python scripts) or executed on a schedule.

The Python script reddit_pipeline.py acts as the orchestrator for this pipeline, executing the following Python scripts for each stage of the pipeline:

extract_reddit_etl.py to extract data from Reddit API and save as CSV
upload_aws_s3.etl.py to upload CSV to S3
upload_to_snowflake_etl.py to copy data to Snowflake

This means the full pipeline can be run by simply running the command python reddit_pipeline.py <output-file-name> (where is the name of the CSV to store in S3). Each step of the pipeline can also be run individually if needed with the same approach: python <name-of-script.py> <output-file-name>.

For a production environment, we'd certainly want to use a more advanced implementation for orchestration, such as Airflow.

Configuration

As a best practice, I used a configuration.conf file to store all sensitive info (credentials for AWS, Snowflake, Reddit, etc.). To recreate this pipeline, you should do the same, with a configuration.conf file (and include it in your .gitignore to avoid exposing info) in the form of:

```
[aws_config]
bucket_name = xxxxxx
account_id = xxxxxx
aws_region = xxxxxx

[reddit_config]
secret = xxxxxx
developer = xxxxxx
name = xxxxxx
client_id = xxxxxx

[snowflake_config]
username = xxxxxx
password = xxxxxx
account = xxxxxx
```

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
images		images
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit ETL Pipeline

Architecture

Output

Pipeline Orchestration Details

Configuration

About

Releases

Packages

Languages

mimoyer21/reddit-api-pipeline

Folders and files

Latest commit

History

Repository files navigation

Reddit ETL Pipeline

Architecture

Output

Pipeline Orchestration Details

Configuration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages