The final project for the Big Data Course A.Y. 2022/2023 at the University of Rome la Sapienza.
The project involves implementing 3 different techniques to solve the Spotify Million Dataset Playlist Challenge, hosted on AICrowd.
The methods are implemented using Pyspark in order for the data to work on a distributed system.
There is also a re-implementation of the MMCF: Multimodal Collaborative Filtering for Automatic Playlist Continuation by the "Hello World" team that classified in 2nd place in the challenge. The re-implementation consists in converting the Neural Network from Tensorflow v1 to PyTorch, and using Petastorm to create a PyTorch DataLoader from a Pyspark DataFrame in order to keep the data distributed.
The folder structure is the following:
core
: contains the notebooks and other files that constitute the core algorithms that implement the recommender systems;slides
: contains the source code for the presentation made using Slidev.webapp
: contains the code for a demo app built with Vite + React + FastAPI that showcase the usage of the system;
demo.mp4
Python • PySpark • PyTorch • Petastorm • Typescript • MinIO • React • Tailwind • FastAPI • Vite • MongoDB • Docker • Slidev