huwngnosleep / complete_lakehouse_techstack Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

This project implements an end-to-end techstack for a data platform, for local development.

1 star 0 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.vscode		.vscode
scripts		scripts
services		services
.env_init		.env_init
.gitignore		.gitignore
A40509_CDTN_CSDL_final_07_07.docx		A40509_CDTN_CSDL_final_07_07.docx
README.md		README.md
architecture.jpg		architecture.jpg
complete_lakehouse_techstack.iml		complete_lakehouse_techstack.iml
init_metastore.sh		init_metastore.sh
initialized_file		initialized_file
requirements.txt		requirements.txt
start_all_service.sh		start_all_service.sh
test.sql		test.sql

Repository files navigation

This project implements an end-to-end tech stack for a data platform

follow Data Lake-House architecture, there are main interfaces of this platform:

Distributed query/execution engine: Spark Thrift Server
Stream processing: Kafka
Storage: HDFS
Data mart: ClickHouse
Orchestration: Airflow
Main file format: Parquet with Snappy compression
Warehouse table format: Hive and Iceberg

How-to-run will be uploaded later

First run:

Install external dependencies:

Install JDK 23: https://download.oracle.com/java/23/latest/jdk-23_linux-aarch64_bin.tar.gz
Install Hadoop 3.4.0: https://dlcdn.apache.org/hadoop/common/hadoop-3.4.0/hadoop-3.4.0.tar.gz

Extract downloaded files to /services/airflow/dependencies and extract them

Change the variable IS_RESUME in ./services/metastore/docker-compose.yml to False
Grant all permissions for HDFS

sudo mkdir -p ./services/hadoop/data
sudo chmod  777 ./services/hadoop/data/*

Create docker network docker network create default_net
Docker up bash start_all_service.sh

After finishing all the above steps, change IS_RESUME back to True then rerun start all service

Check UI page of running services

Apache Airflow: localhost:8080
Apache Superset UI: localhost:8088
Spark Master UI : localhost:8082
Spark Thrift UI: localhost:4040

Services ports for TCP connections

Nginx for http logging data source: localhost:8183
Spark Master: localhost:7077
Spark Thrift: localhost:10000
MSSQL Server: localhost:1433 - user: root - password: root@@@123
Kafka:
- Kafka UI: http://localhost:9090
- Kafka broker: localhost:9092

About

This project implements an end-to-end techstack for a data platform, for local development.

kafka spark hadoop etl bigdata data-warehouse data-platform lambda-architecture data-lakehouse

Report repository

Releases

No releases published

Packages

No packages published

Languages