Skip to content

An ELT pipeline to pull post data from Reddit's r/dataengineering subreddit and push to S3 and Snowflake. Once in Snowflake, data is then transformed via dbt (not orchestrated in these scripts)

Notifications You must be signed in to change notification settings

mimoyer21/reddit-api-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit ETL Pipeline

A data pipeline to extract Reddit data from r/dataengineering.

Output is a Snowflake dashboard, providing insight into the Data Engineering official subreddit.

Architecture

  1. Extract data using Reddit API
  2. Load into AWS S3
  3. Copy into Snowflake
  4. Transform using dbt
  5. Create Snowflake Dashboard
  6. Orchestrate the above via a Python script

Output

  • Final output in form of Snowflake dash (may need permissions to view). Link here.

Pipeline Orchestration Details

The code in this repo handles the steps up until the transformation via dbt. dbt steps onward were configured separately (e.g. building the Snowflake dashboard). To use this pipeline in production, the execution of dbt models could be brought into this workflow (orchestrated via the Python scripts) or executed on a schedule.

The Python script reddit_pipeline.py acts as the orchestrator for this pipeline, executing the following Python scripts for each stage of the pipeline:

  1. extract_reddit_etl.py to extract data from Reddit API and save as CSV
  2. upload_aws_s3.etl.py to upload CSV to S3
  3. upload_to_snowflake_etl.py to copy data to Snowflake

This means the full pipeline can be run by simply running the command python reddit_pipeline.py <output-file-name> (where is the name of the CSV to store in S3). Each step of the pipeline can also be run individually if needed with the same approach: python <name-of-script.py> <output-file-name>.

For a production environment, we'd certainly want to use a more advanced implementation for orchestration, such as Airflow.

Configuration

As a best practice, I used a configuration.conf file to store all sensitive info (credentials for AWS, Snowflake, Reddit, etc.). To recreate this pipeline, you should do the same, with a configuration.conf file (and include it in your .gitignore to avoid exposing info) in the form of:

```
[aws_config]
bucket_name = xxxxxx
account_id = xxxxxx
aws_region = xxxxxx

[reddit_config]
secret = xxxxxx
developer = xxxxxx
name = xxxxxx
client_id = xxxxxx

[snowflake_config]
username = xxxxxx
password = xxxxxx
account = xxxxxx
```

About

An ELT pipeline to pull post data from Reddit's r/dataengineering subreddit and push to S3 and Snowflake. Once in Snowflake, data is then transformed via dbt (not orchestrated in these scripts)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages