Skip to content

mcastillomolina/weather-history-pipeline

Repository files navigation

Weather historical data ETL

This documentation provides a guide on how to set up and run the Weather ETL pipeline using Docker Compose.

Structure

The main file is the pipeline.py file. The code for the Extraction, transformation and load of data can be found in the etl/ directory.

At a high level, the structure of this repository is:

weather-history-pipeline/
│
├── etl/
│   ├── extract_step.py
│   ├── transform_step.py
│   └── load_step.py
├── metrics_sql_queries/
│   ├── avg_temperature.sql
│   ├── max_wind_speed_change.sql
│
├── pipeline.py
└── README.md
└── docker-files

Prerequisites

Docker: Ensure Docker is installed on your machine.

Docker Compose: Ensure Docker Compose is installed.

Using the docker desktop is highly recommended as it is quite intuitive. But it is not a requirement.

Running the Pipeline

Make sure to start a terminal located in the root directory of this project. This will be in weather-history-pipeline/

Build the Docker Images:

docker-compose build

Start the Docker Containers:

docker-compose up

et voilá! The pipeline is running!

Check the Logs:

docker logs weather-pipeline

These commands will spin up the database container (mysql_db) and the pipeline container (weather-pipeline). The pipeline container will execute the pipeline once and then turn off. To run it again, just re-start the pipeline container. You can re-run it as many times as you want!

Example logs of multiple runs alt text

If using the docker desktop, you can hit the start button with the container selected to re-run the pipeline container alt text

Results

The pipeline will create and populate the table weather-historical in the database weather-forecast alt text alt text

As per the last run of the queries, executed on the morning of Monday October 7th.

Average observed temperature for last week(Mon-Sun).

Result = 26.899083

alt text

Maximum wind speed change between two consecutive observations in the last 7 days.

Result = 8.1

alt text

Queries

The queries required to find the average observed temperature and maximum wind speed change can be found in the metrics_sql_queries/ directory.

Avg temperature

Max wind speed change

Disclaimers

The station ID picked is 000PG, but this is a parameter of the pipeline and can be modified in the envirorment variables of the docker-compose.yml file. The envirorment variable is - STATION_ID=000PG

The creation of the table was included in the ETL pipeline, but this logic could be also decoupled from the pipeline if required. The table creation could be set up on database creation using a init_db script, which could be beneficial to have a separation of concerns and not having schema creation logic in the pipeline.

Logs are set to INFO level, but this is a parameter as well and can be set in the docker-compose.yml file. The env variable is - LOG_LEVEL=INFO

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published