Skip to content

Latest commit

 

History

History
67 lines (45 loc) · 6.47 KB

README.md

File metadata and controls

67 lines (45 loc) · 6.47 KB

Shipping data

This repository contains a public collection of shipping data from South America to The Netherlands and Belgium, and the tools for obtaining this data.

Repository structure

The repository is structured as follows:

  • webscrapers contains the Python file for each webscraper.
  • docs !!!! contains all the specific documentation for each webscraper.
  • notebooks contains (parts of the) webscrapers in Notebook format. This was mainly used for developing the webscrapers.
  • pickles contains Pickle files for all data in Python (mostly Pandas DataFrames).
  • data contains all other data (mostly CSVs).
  • scripts contains helper Python scripts to merge data for example.
  • utils contains utility files and notebooks, for example with UN-LOCODES for ports.
  • .github/workflows contains GitHub Actions workflow files for automation.

And less importantly:

The webscrapers

In this section the current webscrapers are globally explained.

Routescanner

The Routescanner webscraper scans planned container connections from https://www.routescanner.com/voyages. The webscraper is available at webscrapers/routescanner_automated.py and runs each morning thanks to the routescanner_daily.yml GitHub Actions workflow. It's currently configured to scrape all connections between 26 departure ports in South America and Vietnam, and 5 arrival ports in The Netherlands and Belgium. For these 130 port-combinations around 1400 connections are found each run, while not all unique.

Daily run data is saved in data/routescanner_daily in CSV form and in pickles/routescanner_daily using pickles, which each contain a Pandas DataFrame. The scripts/combine_routescanner.py can be used to merge all the DataFrames, of which the resulting combined data can be found in pickles/routescanner_connections_combined.pickle and data/routescanner_connections_combined.csv.

A Jupyter Notebook is also available at notebooks/scraping_routescanner.ipynb.

The ports are listed in UN/LOCODE, for example NLRTM for the Port of Rotterdam in The Netherlands. The utils folder contains a two CSV files with all UN/LOCODEs and a Jupyter Notebook to load those into a Pandas Dataframe.

For the full documentation, see docs/routescanner.md, as well as the inline comments.

MSC

An initial version of the MSC webscraper is available at webscrapers/msc_automated.py. It scraped the same 130 port-combinations as the Routescanner scraper from the MSC schedule (https://www.msc.com/en/search-a-schedule).

Initial data is available in CSV and Pickle form at data/msc_daily and pickles/msc_daily. An experimental notebook can be found at notebooks/scraping_msc.ipynb.

A script to combine the data from multiple days is available at scripts/combine_msc.py. The combined data itself is available as CSV and Pickle at data/msc_connections_combined.csv and pickles/msc_connections_combined.pickle.

It's in prototype state, with a lot of open bugs and unwanted behaviour. Data collected will be incomplete.

Maersk

The Maersk scraper used the point to point function on the site https://www.maersk.com/schedules/pointToPoint. The webscraper is available at [webscrapers/maersk.py]. Similarly to MSC and routescanner, it scrapes 130 port-combinations. Initial data is available in CSV and Pickle form at data/maersk_daily and pickles/maersk_daily.\

The Maersk scraper sometimes doesn't provide a route between origin and destination, even though the route actually exists. Errors from this issue have been prevented. If the daily run doesn't find a route, another daily run will find it. Reason for this is that all routes are not daily routes. Therefore the routes can be found on the site for multiple days and therefore be scraped.

MSC v2 scraper

The MSC v2 scraper uses another method called API scrapping. It basically makes an API call to the internal API MSC also uses to load the data from their website. The biggest advantage is that is has way less overhead (websites don't need te be loaded) and is thus faster. The data also is gathered in a more structured way.

Daily scraping

GitHub Actions is used to run the scraping scripts each day. The scraped data is committed directly to the data_staging branch. From that branch, a pull request to the main branch can be run (Squash and merge to prevent a huge list of commits), after which the scripts to merge the dataframes can be ran.

Other scripts

License

As for now, all code in this repository is licensed under the GPL-3.0 license. At a later moment in this project this may change to a more permissive license.

See the LICENSE file.