Skip to content

Latest commit

 

History

History
124 lines (87 loc) · 11 KB

README.md

File metadata and controls

124 lines (87 loc) · 11 KB

Weather Data

Table of Content

Overview

Weather has an undeniable effect on the transportation network, influencing people's behaviour, impacting capacity, and increasing the likelihood of collisions. We import two types of weather data from Environment Canada to our database, which includes historical data for both City of Toronto and Toronto Pearson Airport, and also forecast data for City of Toronto. These data are stored in the weather schema, maintained by weather_admins, and accessibile for all bdit_humans to view.

Historical Data

We import the historical data for two locations on a daily basis, City of Toronto (Station_id = 31688) and Toronto Pearson Airport (Station_id = 51459). Weather data for historical weather tables are pulled from the Government of Canada's "Historical Data" weather page, found here

Please note that the total_rain and total_snow fields for the city weather station are always NULL. All other fields can be accessed in the table weather.historical_daily_city. As an approximation, we import Toronto Pearson Airport's total rainfall and snowfall in a seperate table weather.historical_daily_airport.

Table Structure

Column name Description example
dt Date 2023-05-01
temp_max The highest temperature in degrees Celsius (°C) observed 10.7
temp_min The lowest temperature in degrees Celsius (°C) observed 6
mean_temp The mean temperature in degrees Celsius (°C) is defined as the average of the maximum and minimum temperature 8.4
total_rain The total rainfall, or amount of all liquid precipitation in millimetres (mm) such as rain, drizzle, freezing rain, and hail 1.3
total_snow The total snowfall, or amount of frozen (solid) precipitation in centimetres (cm), such as snow and ice pellets 0
total_precip The sum of the total rainfall and the water equivalent of the total snowfall in millimetres (mm) 1.3

Forecast Data

Forecast Data is inserted into weather.prediction_daily on a daily basis from Environment Canada's Local Forecast website. Location is set as Toronto, ON with s0000458 station_id. Every day we pull forecast data for tomorrow and 4 days in the future (e.g. If today is Monday, we will pull in data for Tuesday to Saturday). Since past forecast data is not being stored in Environment Canada, this task cannot be backfilled. Thus, we pull 5 future days to limit the chance of not having any data for days when the pipeline fails.

Table Structure

Column name Description example
dt Date 05/04/2023
temp_max Forecasted highest temperature in degrees Celsius (°C) of the day 16
temp_min Forecasted lowest temperature in degrees Celsius (°C) of the day 6
precip_prob_day Forecasted chance of precipitation before noon in percentage (%) 30
precip_proc_night Forecasted chance of precipitation after noon in percentage (%) 0
text_summary_day Detailed forecast summary before noon A mix of sun and cloud with 30 percent chance of showers. High 16.
text_summary_night Detailed forecast summary after noon Cloudy periods. Low 6.
date_pulled Day of when the data is pulled 05/03/2023

Data Pipeline - weather_pull DAG

The data pipeline runs at 6:30 AM daily on airflow with the DAG weather_pull using the bigdata weather_bot for inserts.
Note: Around 2024-06-03, the weather DAG was renamed from pull_weather with minor changes.

weather_pull DAG graph

  • no_backfill: Uses LatestOnlyOperator that disable downstream tasks for backfill. This is set as an upstream for pull_prediction as forecast data cannot be backfilled.

  • wait_till_1030am: Uses TimeSensor operator to delay the excution of this task to 1030am which is when forecast becomes available.

  • pull_prediction: Runs script prediction_import.py which uses package env_canada to pull City of Toronto's forecast data of the next 5 days and insert into weather.prediction_daily.

  • pull_historical_city: Runs script historical_scrape.py which uses package Beautiful Soup to parse the HTML content returned from request for City of toronto. The day before execution date's data will be pulled and inserted into weather.historical_daily_city.

  • pull_historical_airport: Runs script historical_scrape.py which uses package Beautiful Soup to parse the HTML content returned from request for Toronto Pearson Airport. The day before execution date's data will be pulled and inserted into weather.historical_daily_airport.

  • check_null_historical: Runs a SQL check to see if a row of all nulls (or no data) was inserted into either of the historical tables. Cannot be turned into a constraint because there are some valid days where no data is avaialable.

Backfill Data

As mentioned before, only historical data can be backfilled. Other than backfilling from airflow's cli, you can also use the script backfill_historical.py for when you backfill a lot of dates (since you will have to backfill day by day in airflow). Note: Make sure you have a db.cfg in your home folder, and you have weather_admins permissions.

Input Params:

  1. start_dt: The start date of the date range you want to backfill historical data, inclusive.
  2. end_dt: The end date of the date range you want to backfill historical data, exclusive.
  3. station_id: The station_id to backfill, e.g. City of Toronto is station_id = 31688 and Pearson Airport is station_id = 51459

For example, if you want to backfill the entire month of 2022 March for Pearson Airport, you would run:

python3 backfill_historical -s 2022-03-01 -e 2022-04-01 -i 51459

Manually Accessing Data

To access weather data outside of the stations & dates provided in the database, use the following links:

Access Environment Canada's Historical Data Website. Then, proceed with the following stpes:

  1. In the search boxes, enter the desired parameters, including station and date range:

historical1

  1. Select the desired station from the list (there can be quite a few, so try limiting the search 'year' in the previous step. Then, select 'Daily' from the data dropdown (hourly and monthly data can also be accessed) , and desired month/year: historical2

  2. View data (same format as in the weather schema): historical3

Uploading Manually Accessed Historical Data into the Database

If the backfill process is not working, or cannot find the desired data, historical data can also be imported into postgres, using the csv_import.py script.

Please Note that this script is only to be used as a last resort. It does not contain any backstops for duplicate/NULL entries, broken data, etc. Raw data and duplicates will have to be checked manually.

  1. Visit the desired weather station page, containing historical data, using the steps above. Then, you will see this prompt:

historical4

From here, you can select the format to download an entire year's worth of weather data for the current station. For our purposes, we will download in the CSV format.
  1. Open the script csv_import.py in a text editor. Here, on Line 24, you can change the directoryto match the filepath of the CSV just downloaded. If uploading several weather CSV files, you can create a new folder containg those files - the script will process all CSV files. csvimport2

  2. On Line 30, the table to insert into can be changed. csvimport1

  3. Navigate to the script directory in a new terminal and run the script using python csv_import.py