follow Data Lake-House architecture, there are main interfaces of this platform:
- Distributed query/execution engine: Spark Thrift Server
- Stream processing: Kafka
- Storage: HDFS
- Data mart: ClickHouse
- Orchestration: Airflow
- Main file format: Parquet with Snappy compression
- Warehouse table format: Hive and Iceberg
- Install external dependencies:
- Install JDK 23: https://download.oracle.com/java/23/latest/jdk-23_linux-aarch64_bin.tar.gz
- Install Hadoop 3.4.0: https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz
Extract downloaded files to /services/airflow/dependencies
and extract them
-
Change the variable IS_RESUME in ./services/metastore/docker-compose.yml to False
-
Grant all permissions for HDFS
sudo mkdir -p ./services/hadoop/data
sudo chmod 777 ./services/hadoop/data/*
-
Create docker network
docker network create default_net
-
Docker up
bash start_all_service.sh
After finishing all the above steps, change IS_RESUME back to True then rerun start all service
- Check UI page of running services
- Apache Airflow: localhost:8080
- Apache Superset UI: localhost:8088
- Spark Master UI : localhost:8082
- Spark Thrift UI: localhost:4040
- Services ports for TCP connections
- Nginx for http logging data source: localhost:8183
- Spark Master: localhost:7077
- Spark Thrift: localhost:10000
- MSSQL Server: localhost:1433 - user: root - password: root@@@123
- Kafka:
- Kafka UI: http://localhost:9090
- Kafka broker: localhost:9092