The official implementation of "DANET: DIFFERENCE-AWARE ATTENTION NETWORK FOR SINGING MELODY EXTRACTION FROM POLYPHONIC MUSIC". We propose a difference-aware attention network (DANet) for melody extraction, which can effectively characterize the fundamental frequency based on perceiving and emphasizing harmonic contour. Experimental result demonstrates the effectiveness of the proposed network.
(i) Update the critical code, and the rest of the code will be released soon.
Uploaded:
Subsequent update:
-
control group model
-
data generation code
-
complete code with training and testing
To esteemed readers and reviewers:
- In the manuscript, we regret to acknowledge that due to spatial limitations, some details in Figures 1 and 2 may appear smaller than desired and require enlargement for clear visibility, which could affect the visual experience. We deeply apologize for any inconvenience this may cause.
We also experimented element-wise subtraction between the outputs of convolutions of different sizes on the speech denoising and dereverberation task, which seems to make the spectral boundaries of the target speaker more obvious. On the whole, this simple operation can make the feature boundaries more significant and seems that there can be more possibilities based on this in the future.
After downloading the data, use the txt files in the data folder, and process the CFP feature by feature_extraction.py.
Note that the label data corresponding to the frame shift should be available before generation.
Refer to the file: danet.py
The visualization illustrates that our proposed DANet can reduce the octave errors and the melody detection errors.
- We adopt a visualization approach to explore what types of errors are solved by our model as shown in the above. We choose MSNet to compare due to its structural similarity and popularity. Specifically, we plot the predictive frequencies over the time and ground truths by the DANet and MSNet on two opera songs: “opera male3.wav” and “opera male5.wav” from the ADC2004. We can observe that there are fewer octave errors (i.e., vertical jumps in the contours inside the red circle) in (a) / (c) than (b) / (d). Furthermore, there are fewer melody detection errors around 250ms and 750ms (i.e., predicting a melody frame as a non-melody one) in (c) than (d).
- The first and second picture show the output of the time-frequency attention module.
- The third and fourth picture show the output of the calibration fusion module.
- We can find that the features emphasize harmonic and F0 components of the dominant melody in the first and third picture, while the features emphasize accompaniment and noise components in the second and fourth picture (the alternative view is that the features reversely emphasize harmonic and F0 components of the dominant melody).
The scores here are either taken from their respective papers or from the result implemented by us. Experimental results show that our proposed DANet achieves promising performance compared with existing state-of-the-art methods.
The models in the above table correspond to paper and codes:
model | published | paper | code | model | published | paper | code |
---|---|---|---|---|---|---|---|
MCDNN | ISMIR2016 | paper | code | MSNet | ICASSP2019 | paper | code |
FTANet | ICASSP2021 | paper | code | TONet | ICASSP2022 | paper | code |
HGNet | ICASSP2022 | paper | - |
We conducted seven ablations to verify the effectiveness of each design in the proposed network. Due to the page limit, we selected the ADC2004 dataset for ablation study in the paper. More detailed results are presented here.
Source of noise database: Microsoft Scalable Noisy Speech Dataset (MS-SNSD)
- noisy data for the testing → evaluate the noise immunity and generalization of different models.
The results show that the DSM model and our model are robust to noise.
- noisy data for the training → typical data augmentation
- noisy data for the training and testing → evaluate the resistance of our model to noise effects after training in a noisy environment
DANet with noise for train: Only the training set was randomly added with 0-20db of various types of noise.
DANet with noise for train and test: training and testing sets are randomly added with 0-20db of various types of noise.
Refer to the contents of the folder: pre-train model.