In this project, I built a real-time data ingestion pipeline with Apache Kafka and Spark Streaming to collect and process financial data from Yahoo Finance and Finnhub, analyze it in Jupyter Notebook, and generate financial reports using Power BI.
- Data Source: This project uses two main
data sources
: Yahoo Finance API and Finnhub Stock APIYahoo Finance API
: Data is collected fromYahoo Finance's API
using theyfinance
library, collected inreal time
with an interval between data points of1 minute
, collected data includes indicators such asOpen
,Volume
,Close
,Datetime
,...Finnhub Stock API
: Data is collected fromFinnhub's API
inreal time
, collected data includestransaction
indicators such asv (volume)
,p (last price)
,t (time)
,...
- Extract Data: After being collected, data will be written to
Kafka (Kafka Producer)
with differenttopics
for each differentdata source
. - Transform Data: After data is sent to
Kafka Topic
, it will be read and retrieved usingSpark Streaming (Kafka Consumer)
and performedreal-time processing
.Spark
is set up with3 worker nodes
, applyingSpark's
distributed nature in large data processing. - Load Data: At the same time, when the data is processed, it will be loaded directly into the
Cassandra
Database usingspark
. - Serving: Provide detailed insights, create
financial reports
withPower BI
, andanalyze
investment performance to guide strategic decision-making and optimize portfolio management. - package and orchestration: Components are packaged using
Docker
and orchestrated usingApache Airflow
.
Yahoo Finance API
Finnhub Stock API
Apache Kafka
Apache Spark
Cassandra
Power BI
Jupyter Notebook
Apache Airflow
Docker