Skip to content

Latest commit

 

History

History
45 lines (28 loc) · 966 Bytes

README.md

File metadata and controls

45 lines (28 loc) · 966 Bytes

Bigdata Pipeline

In this repository, I deployed a simple data pipeline for processing, storing, and visualizing data. Hands-on Big Data solution with batch processing and stream processing.

Architecture diagram

📖Overview

  • Capture: Restful API, Flat file about stock
  • Ingest: Kafka
  • Store: Hadoop Hive
  • Compute: Spark, Flink, Trino
  • Visualize: Superset
  • Workflow:: Airflow
  • Container Orchestration: Docker

🛠️ Quick Start:

To deploy a cluster, run:

docker compose up
bash script/setup.sh

Spark calculates some returns like simple return, log return, and cumulative return.

Use Superset to visualize the report for Business Insight. Please connect to Trino with Superset:

Example:

Update:

  • Build stream processing
  • Build batch processing
  • Integrate Airflow
  • Integrate multisource