This workshop at PyBCN 2022 is a detailed guide to help you navigate the modern data stack and build your own platform using open-source technologies. Data engineering has experiences enormous growth in the last years, allowing for rapid progress and innovation as more people than ever are thinking about data resources and how to better leverage them. In this talk we will explore the related technologies and build from scratch an end-to-end modern data platform for the analysis of medical data.
We will be using open-source tools and libraries, including python-based DBT, Apache Airflow and Querybook.
The platform will consist of the following components:
- Data warehouse
- Data integration
- Data transformation
- Data orchestration
- Data visualization
- Install Python
- Install Java
- Install docker
- in Linux edit your /etc/hosts and add
172.17.0.1 docker.host.internal
- in Linux edit your /etc/hosts and add
- Download synthea patient data generator: synthea-with-dependencies.jar
- Install Dremio:
docker pull dremio/dremio-oss
- Install PostgreSQL:
docker pull postgres
- Install DBT:
pip install dbt-postgres dbt-dremio
- Install Airflow
- Install Querybook
- Install Datahub
- Clone this repo
git clone https://github.com/alabarga/pybcn22-modern-data-stack.git
- Generate or download synthetic data