Kafka to Clickhouse

Description

The aim of this project is to implement an ETL system for analysts that stores data about movie views. Since the service needs to handle the constant influx of information from each user, it uses the event streaming platform Kafka. To provide an API layer that sends events to Kafka without any transformations underneath, it leverages the FastAPI framework. The ETL process for loading data into the analytical data store is implemented using the batch and stream data processing library PySpark. The storage must handle very large data and do so within a reasonable time frame for analysts to conduct their research. Therefore, the project involved research to choose the right storage solution, and the best choice was the analytical OLAP system ClickHouse.

Technologies

Python Kafka FastAPI PySpark Clickhouse Vertica Jupyter Notebook Docker

How to Run the Project:

Clone the repository and navigate to the infra directory:

git clone https://github.com/temirovazat/kafka-to-clickhouse.git

cd kafka-to-clickhouse/infra/

Create a .env file and add project settings:

nano .env

# Kafka
KAFKA_HOST=kafka
KAFKA_PORT=9092

# Clickhouse
CLICKHOUSE_HOST=clickhouse-node1
CLICKHOUSE_PORT=9000

Deploy and run the project in containers:

docker-compose up

Send a POST request with the current movie view frame:

http://127.0.0.1/films/<UUID>/video_progress

{
    "frame": <INTEGER>
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Kafka to Clickhouse

Description

Technologies

How to Run the Project:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Kafka to Clickhouse

Description

Technologies

How to Run the Project: