Skip to content

Fetch all accredited diplomats in Germany - An open data project with Airflow, Docker, BigQuery, dbt and Terraform on GCP

Notifications You must be signed in to change notification settings

LoHertel/diplomats-in-germany

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diplomats in Germany

The German ministry of foreign affairs (Auswärtiges Amt) regularly updates a list of all foreign diplomatic missions (embassies) and their diplomats, which the Federal Republic of Germany accredits.

The list in PDF format could be retrieved from the following website. Click on "Diplomatische und andere Vertretungen in der Bundesrepublik Deutschland" to download the newest version of the PDF file.

Problem

It is difficult to study changes in the list over time, mainly because of two reasons:

  • Previous versions of the list are not made available, even if you have saved the specific link to the respective PDF file. Therefore the files needed to be downloaded regularly.
  • Even if you have downloaded multiple versions, it is hard to compare the data in the PDF format and, therefore, it needs to be extracted for easier comparison. Have a look at the PDF for a better understanding.

Motivation

This project could help overcome the mentioned problems by regularly checking if a new version of the PDF report was published. The data is extracted and saved in a database for each new version of the PDF report. A raw data vault DWH model is used to track changes between the versions.

This data model could help answer questions such as:

  • How many diplomats were accredited in Germany at a given time?
  • What is the average time diplomats spend at a diplomatic mission in Germany?
  • What is the gender ratio for a specific diplomatic mission?

Infrastructure

The following tools and technologies are used:

infrastructure diagram

Dashboard

Link to the dashboard: https://datastudio.google.com/reporting/c67883ee-7b3a-481f-a28f-e001b0c3c743

Dashboard

Run the Walkthrough Tutorial

Pre-Requisite: Create a free GCP account if you do not have one. Note that GCP also offers $300 free credits for 90 days.

You can follow these interactive instructions as a guide through the setup process.

Open in Cloud Shell

After clicking the button Open in Google Cloud Shell, the Google Cloud Shell Editor will open and ask for your authorization to clone this repository. The interactive walkthrough for the project opens on the right side of the Cloud Shell Editor. Note: There might be an error message showing that third-party cookies are needed. You can allow third-party cookies for the Cloud Shell Editor. See here for more information.

Note: If you have closed the walkthrough and want to reopen it, run the following command in the cloudshell terminal window:

cd ~/cloudshell_open/diplomats-in-germany && cloudshell launch-tutorial project-walkthrough.md

If you don't want to use Cloud Shell Editor, you could go through the instructions manually: Open Instructions

Further Improvement

  • Setup
    • create a shell script for VM setup and integrate it in Terraform (also make SSH key creation more robust or use Google service account auth method to connect to VM)
    • create a project-specific IAM role for Terraform and assign it to the service * account instead of granting the permissions directly
    • limit permissions to the minimal necessary set
    • create an additional service account for running the data pipeline (currently, the Terraform user with a lot of permissions is used)
    • integrate Cloud Build for CI / CD
  • Airflow
    • create a DAG for the initial load of the old PDF reports
    • find out why XCOM doesn't work with task groups in Airflow
    • use Cloud Composer instead of Compute Engine VM
  • dbt
    • add business data vault
  • Data Studio
    • remove calculated fields from Data Studio report and precalculate this data in the datamart (otherwise we can not use the report as template and change the data sources)
    • compare differences between two report dates
  • spell checking

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course completely free. If you want to upskill on Data Engineering technologies, please check out their self-paced course. :)

About

Fetch all accredited diplomats in Germany - An open data project with Airflow, Docker, BigQuery, dbt and Terraform on GCP

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published