Skip to content

Refactoring my Strava pipeline to use dlt, dagster, duckdb, and dbt-core

Notifications You must be signed in to change notification settings

jairus-m/dagster-dlt

Repository files navigation

The Analytics Development Lifecycle within a Modern Data Engineering Framework

Utilizing dltHub, dbt, + dagster as a framework for developing data products with software engineering best practices.

Slide1

While the short-term goal is to learn these tools, the greater goal is to understand and flesh out what the full development and deployment cycle can look like for orchestrating a data platform and deploying custom pipelines. There is a great process using dbt where we have local development, testing, versioning/branching, CICD, code-review, separation of dev and prod, project structure/cohesion etc., but how can we apply that to the entire data platform and espeacially, the 10-20% of ingestion jobs that cannot be done in a managed tool like Airbyte and/or is best done using a custom solution?

Current Status

Screenshot 2024-12-29 at 9 28 52 AM

Dagster

  • Orchestrated ingest, transformation, and downstream dependecies (ML/Analytics) with Dagster - #2, #6
    • Developed in dev environment and materaizlied in dagster dev server
    • Configured resources / credentials in a root .env file
    • Current Dagster folder structure (dependencies managed by UV) - #15
      • One code location: dagster_proj/
        • Assets: dagster_proj/assets/
        • Resources: dagster_proj/resources/__init__.py
        • Jobs: dagster_proj/jobs/__init__.py
        • Schedules: dagster_proj/schedules/__init__.py
        • Utils: dagster_proj/utils/__init__.py
        • Definitions: dagster_proj/__init__.py
      • The structure is experimental and based on the DagsterU courses

dltHub

  • Built a dltHub EL pipeline via the RESTAPIConfig class in dagster_proj/assets/dlt/activities.py
    • Declaratively extracts my raw activity data from Strava's REST API and loads it into DuckDB
    • Created a custom configurable resource for Strava API - #5, #11

dbt-core

  • Built a dbt-core project to transform the activities data in analytics_dbt/models

Sklearn ML Pipeline

  • Created an Sklearn ML pipeline to predict energy expenditure for a given cycling activity
    • WIP but the general flow of preprocessing, building the ML model, training, testing/evaluation, and prediction can be found in dagster_proj/assets/ml_analytics/energy_prediction.py
    • This a downstream dependency of a dbt asset materialized in duckdb

Analytics

  • Created a Plotly analytics dashboard + an ML results related visulization - #14
    • In dagster_proj/assets/ml_analytics/weekly_totals.py

Deployment Status

  • Deployed this project to Dagster+
    • CICD w/ branching deployments for every PR
  • Seperated execution environments - #13
    • dev (DuckDB)
    • branch (Snowflake)
    • prod (Snowflake)
  • Configured pre-commits / CI checks and added unit tests - #16
    • Added ruff Python linter - #8
    • Astral uv for Python dependency management - #1

TODO:

  • Beef up the ML pipeline with dagster-mlflow for experiment tracking, model versioning, better model observability, etc
  • Add new Strava end points
  • Implement partitions/backfilling with dlt/Dagster

Getting Started:

For local development only:

  1. Clone this repo locally
  2. Create a .env file at the root of the directory:
# these are the config values for local dev and will change in branch/prod deployment
DBT_TARGET=dev
DAGSTER_ENVIRONMENT=dev
DUCKDB_DATABASE=data/dev/strava.duckdb

#strava
CLIENT_ID= 
CLIENT_SECRET=
REFRESH_TOKEN=
  1. Download uv and run uv sync
  2. Build the Python package in developer mode via uv pip install -e ".[dev]"
  3. Run the dagster daemon locally via dagster dev
  4. Materialize the pipeline!

Additional Notes:

  • The refresh_token in the Strava UI produces an access_token that is limited in scope. Please follow these Strava Dev Docs to generate the proper refresh_token which will then produce an access_token with the proper scopes.
  • If you want to run the dbt project locally, outside of dagster, you need to add a DBT_PROFILES_DIR environment variable to the .env file and export it
    • For example, my local env var is: DBT_PROFILES_DIR=/Users/jairusmartinez/Desktop/dlt-strava/analytics_dbt
    • Yours will be: DBT_PROFILES_DIR=/PATH_TO_YOUR_CLONED_REPO_DIR/analytics_dbt

About

Refactoring my Strava pipeline to use dlt, dagster, duckdb, and dbt-core

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages