Skip to content

Latest commit

 

History

History
54 lines (43 loc) · 2.47 KB

README.md

File metadata and controls

54 lines (43 loc) · 2.47 KB

Kafka to Clickhouse

python dockerfile lint code style platform

Description

The aim of this project is to implement an ETL system for analysts that stores data about movie views. Since the service needs to handle the constant influx of information from each user, it uses the event streaming platform Kafka. To provide an API layer that sends events to Kafka without any transformations underneath, it leverages the FastAPI framework. The ETL process for loading data into the analytical data store is implemented using the batch and stream data processing library PySpark. The storage must handle very large data and do so within a reasonable time frame for analysts to conduct their research. Therefore, the project involved research to choose the right storage solution, and the best choice was the analytical OLAP system ClickHouse.

Technologies

Python Kafka FastAPI PySpark Clickhouse Vertica Jupyter Notebook Docker

How to Run the Project:

Clone the repository and navigate to the infra directory:

git clone https://github.com/temirovazat/kafka-to-clickhouse.git
cd kafka-to-clickhouse/infra/

Create a .env file and add project settings:

nano .env
# Kafka
KAFKA_HOST=kafka
KAFKA_PORT=9092

# Clickhouse
CLICKHOUSE_HOST=clickhouse-node1
CLICKHOUSE_PORT=9000

Deploy and run the project in containers:

docker-compose up

Send a POST request with the current movie view frame:

http://127.0.0.1/films/<UUID>/video_progress
{
    "frame": <INTEGER>
}