This repository contains an implementation of Spark's built in implicit ALS matrix factorization , allowing us to create an implicit reccomender system using the Million Song Dataset https://www.kaggle.com/c/msdchallenge. The 200+ gb dataset is housed on NYU's High Performance Computing Cluster (HPC) : Peele , where all computation was performed. Lastly, this was completed for credit as part of the final project for DS-GA 1004 (Big Data) @ NYU CDS.
The following files were run sequentially to obtain the final results from the ALS Model (ie. 500 recommendations per user)
-
Build_Hash.py : .py file that creates a uniform integer hash key for the train, test, and validation sets. This key is then saved locally on HDFS
-
Parquet_Build.py: .py file that loads in the uniform hash key from HDFS, applies it to each of the datasets, and then writes the new files back out to our local HDFS
-
GridSearch_All.py: .py file that performs grid search on the ALS model
-
GridSearchFinal: Folder that contains the results of our grid search and the corresponding Jupiter notebook
-
FinalModel.py: .py file that contains our final model run, with the optimal hyper parameters (running to a high level of iterations)
-
Subsample.py: .py file that subsamples from train & test user/track/count data (.5%)
-
Lenskit_Extension.ipynb: Extension results