mimbres/seqskip:Spotify Sequential Skip Prediction Challenge
(Since 9th Dec 2018) Paper, WSDM 2019 Cup Result
Our best submission result (aacc=0.637) with sequence learning was based on seqskip_train_seq1HL.py
- For test, seqskip_test_seq1.py
- This model consists of highway layers(or GLUs) and dilated convolution layers.
- Very similar model can be found in their encoder part of DCTTS (ICASSP 2018, H Tachibana et al.).
- Other variants of this model use various attention modules.
Other approaches (in progress):
- Multi-task learning with 1 stack generative model (aacc=0.635) that generates userlog in advance of skip-prediction is in seqskip_train_GenLog1H_Skip1eH.py
- 2 stack large generative model (not submitted) is very slow, seqskip_train_GenLog1H_Skip1eH256full.py
- Learn to compare (CVPR 2018, F Sung et al.), a relation-net based meta-leraning approach is in seqskip_train_rnbc1*.py
- Learn to compare with Some improvement is in seqskip_train_rnbc2*.py
- A naive Transformer with multi-head attention model is in seqskip_train_MH*.py
- seqskip_train_MH_seq1H_v3_dt2.py can be thought as a similar approach to SNAIL (CVPR 2018, N Mishra et al.) without their data shuffling method.
- Distillation (NIPS 2014, G Hinton et al.) approaches are in seqskip_train_T* and seqskip_train_S* for the teacher (Surprisingly, aacc>0.8 in validation, by using metadata for queries!!) and student nets, respectively. We beilieve that this approach can be an intersting issue at the workshop!
- etc.
- 80% for training
- 20% for validation
- PyTorch 0.4 or 1.0
- pandas, numpy, tqdm, sklearn
- tested with Titan V or 1080ti GPU
- Please modify config_init_dataset.json to set path of original data files.
- Please run preparing_data.py(Because the data is huge, we compress it as 8-bit uint formatted np.memmap)
- Thanks to np.memmap, we can have 50Gb+ virtual memory for meta data.
- Acoustic features are loaded into physical memory (11Gb).
- spotify_data_loader.py or spotify_data_loader_v2.py is the data loader class used for training.
- Normalization:
- Many categorical user behavior logs are decoded into one-hot vectors.
- Number of click fwd/backwd was minmax normalized after taking logarithm.
- We didn't make use of dates.
- Acoustic echonest features were standardized to mean=0 and std=1.
https://storage.googleapis.com/skchang_data/seqskip/data/tr_log_memmap.dat https://storage.googleapis.com/skchang_data/seqskip/data/tr_session_split_idx.npy https://storage.googleapis.com/skchang_data/seqskip/data/track_feat.npy https://storage.googleapis.com/skchang_data/seqskip/data/ts_log_memmap.dat https://storage.googleapis.com/skchang_data/seqskip/data/ts_session_split_idx.npy (Updated, Apr 2020)
- Training_Set_Split_Download.txt
- Sample_Submissions.tar.gz
- Training_Set.tar.gz
- Dataset Description
- Terms and Conditions
- Track_Features.tar.gz
- Test_Set.tar.gz
- Training_Set_And_Track_Features_Mini (Updated, Sep 2021)
- plot_dataset.py can display some useful stats of Spotify dataset. You can see them in /images folder.