Skip to content

Latest commit

 

History

History
68 lines (41 loc) · 2.66 KB

File metadata and controls

68 lines (41 loc) · 2.66 KB

Week 2: Data Ingestion

Data Lake (GCS)

  • What is a Data Lake
  • ELT vs. ETL
  • Alternatives to components (S3/HDFS, Redshift, Snowflake etc.)
  • Video
  • Slides

Introduction to Workflow orchestration

  • What is an Orchestration Pipeline?
  • What is a DAG?
  • Video

Setting up Airflow locally

If you want to run a lighter version of Airflow with fewer services, check this video. It's optional.

Ingesting data to GCP with Airflow

  • Extraction: Download and unpack the data
  • Pre-processing: Convert this raw data to parquet
  • Upload the parquet files to GCS
  • Create an external table in BigQuery
  • Video

Ingesting data to Local Postgres with Airflow

  • Converting the ingestion script for loading data to Postgres to Airflow DAG
  • Video

Transfer service (AWS -> GCP)

Moving files from AWS to GCP.

You will need an AWS account for this. This section is optional

Homework

In the homework, you'll create a few DAGs for processing the NY Taxi data for 2019-2021

More information here

Community notes

Did you take notes? You can share them here.