Different architectures to lassify music files based on genre from the GTZAN music corpus, namely:
- Convolutional Neural Network (CNN)
- Recurrant Neural Network (RNN)
- Inception V3
- MobileNet V2
(Implementated with Tensorflow)
In the GTZAN music corpus there's 10 genres with 100 songs each (1000 in total): 80% of it was used during the training phase (800 images), and 20% for testing (200 images). After the split, each song of 30 seconds is split in chunks of 10 seconds (resulting in 2400 and 600 training and testing samples).
Dataset can be downlaoded here: http://marsyas.info/downloads/datasets.html
To increase further the amount of data, some augmentation were done on the audio files. For each song chunk, we applied:
- Add light random noise in the wave form
- Add intense random noise in the wave form
- Increase randomly pitch (2% at most)
For exctracting the audio features, the library librosa was used.
-
Mel-frequency spectrogram as images of size 512x512 (in black and white)
-
Combination of Mel-frequency spectogram, spectral centroid and spectral contrast stacked as images of size 512x512
Example of Mel-frequency Spectrograms
Model | Trainig accuracy | Test accuracy |
---|---|---|
MobileNet V2 (TL) | 77% | 77% |
Inception V3(TL) | 99% | 84% |
CNN | 55% | 62% |
RNN | 77% | 66% |