This project is aimed at analyzing data from the Spotify platform, utilizing the Spotify API and MongoDB for data extraction, Apache Hadoop for ETL processes, PySpark for transformation, and leveraging Dremio and Power BI for visualization and in-depth data analysis.
We initiate our data collection by scraping artists's name list from Spotify Artists. Subsequently, leveraging this list, we utilize the Spotify API to extract comprehensive data about each artist. The obtained raw data undergoes a series of ETL processes.
This is our demo video on Youtube, you can watch via this Link
There are several ways to do in this step, but we will use terraform
to deploy Atlas cluster.
Please follow this Instruction
Clone this project to your machine by running the following command:
git clone https://github.com/PhongHuynh0394/Spotify-Analysis-with-PySpark.git
cd Spotify-Analysis-with-PySpark
then you need to create .env
file base on env_template
cp env_template .env
Now please fill these informations blank in .env
file, this can be done in Prerequisite and Set up your MongoDB Atlas section:
# Spotify
SPOTIFY_CLIENT_ID=<your-api-key>
SPOTIFY_CLIENT_SECRET=<your-api-key>
# Mongodb
MONGODB_USER=<your-user-name>
MONGODB_PASSWORD=<your-user-password>
MONGODB_SRV=<your-srv-link> # Get this from running terraform set up
OK, now it's Docker's job ! Let's build your Docker images of this project by typing make build
in your
terminal
This process might take a few minutes, so just chill and take a cup of coffee ☕
Note: if you failed in this step, just remove the image or restart Docker and try again
If you've done building Docker images, now its time to run your system. Just type make run
Then check your services to make sure everything work correctly:
- Hadoop
localhost:9870
: Namenodelocalhost:9864
: Datanodelocalhost:8088
: Resources Manager
- Prefect
localhost:4200
: Prefect Server
- Data Warehouse
localhost:9047
: Dremio (user:dremio
, password:dremio123
)
- Dashboard:
localhost:8501
: Streamlit
- Notebook:
localhost:8888
: Jupyter Notebook (password ispass
)
We use Prefect to build our data pipeline. When you check out port 4200
, you'll see
prefect UI, let's go to Deployment section, you'll see 2 deployments there correspond to 2 data pipelines
This data flow (or pipeline) is used to scrape data from spotify API by batch and ingest into MongoDB Atlas. It will execute automatically every 2 minutes and 5 seconds.
Tips: The purpose of this flow is preparing your raw data in MongoDB, you would see 4 collections in your database on MongoDB Atlas after this. You should run this flow a few times before run pipeline 2.
This data flow do ETL job. It Extract raw data from MongoDB and first full load into HDFS in bronze layer,
Then Transforming by PySpark in silver and gold layer. You can trigger this flow by press the run
button manually on the top right corner.
Bronze, Silver, Gold layer are just Data Qualification Directiory to store backup of data in HDFS.
We use Dremio to analyze data in HDFS directly. Don't forget the username is dremio
and password is dremio123
.
Then follow this instruction:
Login to Dremie > Add Source > Choose HDFS
The connecting window will appear, please fill as following:
- Name: HDFS
- NameNode Host: namenode
Then press Save to Save your connection. You would see your connection appearing in your main window go to
gold_layer directory and format all .parquet
directories.
Then run your SQL statement and start analyzing.
You can use our SQL statements in warehouse.sql:
These SQL statements used to create analytic view for Power Bi
to draw Dashboard. You can also see it in
PowerBI Dashboard
After all, you can access to Streamlit to see the Dashboard. Moreover, it can also utilize Machine Learning model to Recommend most porpular songs for you.
You can also see it in powerbi_dashboard Or in our Streamlit app
In future, we will update this repo in:
- Utilizing Deep Learning model: In the future, we plan to leverage a Deep Learning model, specifically an NLP model, to analyze the lyrics of tracks.
- Using Flask or other frameworks: Our goal is to switch to Flask or other frameworks, replacing the Streamlit Dashboard for improved functionality.
- Using MongoDB locally: To streamline deployment and allow for personalized configuration, we'll be transitioning to using MongoDB locally.
Huỳnh Lưu Vĩnh Phong Data Engineer Team Lead |
Trần Ngọc Tuấn Data Engineer |
Phạm Duy Sơn Data Science |
Mai Chiến Vĩ Thiên Data Analyst |
Feel free to use 😄