This repository can be used to align lyrics transcripts with the corresponding audio signals. The audio signals may contain solo singing or singing voice mixed with other instruments. It contains a trained deep neural network which performs alignment and singing voice separation jointly. Details about the model, training, and data are described in the associated paper
Schulze-Forster, K., Doire, C., Richard, G., & Badeau, R. "Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation." IEEE/ACM Transactions on Audio, Speech and Language Processing (2021). doi: 10.1109/TASLP.2021.3091817. public version available here.
If you use the model or code, please cite the paper:
@article{schulze2021phoneme, author={Schulze-Forster, Kilian and Doire, Clement S. J. and Richard, Gaël and Badeau, Roland}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={Phoneme Level Lyrics Alignment and Text-Informed Singing Voice Separation}, year={2021}, volume={29}, number={}, pages={2382-2395}, doi={10.1109/TASLP.2021.3091817} }
-
Clone the repository:
git clone https://github.com/schufo/lyrics-aligner.git
-
Install the conda environment:
- If you want to run the model on a CPU:
conda env create -f environment_cpu.yml
- If you want to run the model on a GPU:
conda env create -f environment_gpu.yml
- If you want to run the model on a CPU:
Remember to activate the conda environment.
Please prepare one directory with all audio files. We load the audio files using librosa, so all formats supported by librosa can be used. This includes for example .wav and .mp3. See the documentation for more details.
Please prepare a separate directory with all lyrics files in .txt-format. Each lyrics file must have the same name as the corresponding audio file (e.g. song1.wav --> song1.txt).
You can provide the lyrics as words or as phonemes.
If your lyrics are already decomposed into phonemes, please consider the following:
- We support only the 39 phonemes in ARPAbet notation listed on website of the CMU Pronouncing Dictionary.
- The provided .txt-file should contain one phoneme per line.
- The first and the last symbol should be the space character: >. It should also be placed between each word or at positions where silence between phonemes is expected in the singing voice signal.
- In this case only phoneme onsets and no word onsets can be computed.
If the lyrics are provided as words, they must be processed as follows to be used with the alignment model:
- Generate a .txt-file with a list of unique words:
python make_word_list.py PATH/TO/LYRICS/DIRECTORY --dataset-name NAME
The--dataset-name
flag is optional. It can be used if several datasets should be aligned with this model. The output files will contain the dataset name which defaults to 'dataset1'. This command generates the filesNAME_word_list.txt
andNAME_word2phoneme.txt
in thefiles
directory. - Go to http://www.speech.cs.cmu.edu/tools/lextool.html, upload
NAME_word_list.txt
as word file, and click COMPILE. - Click on the link to see the list of output files. Then, click on the .dict-file. You should now see a list of all words with their corresponding phoneme decomposition.
- Copy the whole list and paste it into
NAME_word2phoneme.txt
in thefiles
directory. - Run the following command:
python make_word2phoneme_dict.py --dataset-name NAME
Use the same dataset name as in step 1. This will generate a Python dictionary to translate each word into phonemes and save it asNAME_word2phonemes.pickle
infiles
. - Done!
The model has been trained on the MUSDB18 dataset using the lyrics extension. Therefore, it will probably work best with similar music. However, we also found it works well on solo singing. Some errors can be expected in challenging mixtures with long instrumental sections.
You can compute phoneme onsets and/or word onsets as follows:
python align.py PATH/TO/AUDIO/DIRECTORY PATH/TO/LYRICS/DIRECTORY \ --lyrics-format w --onsets p --dataset-name dataset1 --vad-threshold 0
Optional flags (defaults are shown above):
--lyrics-format
Must be w
if the lyrics are provided as words (and has been processed as descrived above) and p
if the lyrics are provided as phonemes.
--onsets
If phoneme onsets should be computed, set to p
. If word onsets should be computed, set to w
. If phoneme and word onsets should be computed, set to pw
(only possible if lyrics are provided as words).
--dataset-name
Should be the same as used for data preparation above.
--vad-threshold
The model also computes an estimate of the isolated singing voice which can be used as Voice Activity Detector (VAD). This may be useful in challenging scenarios where long pauses are made by the singer while instruments are playing (e.g. intro, soli, outro). The magnitude of the vocals estimate is computed. Here a threshold (float) can be set to discriminate between active and inactive voice given the magnitude. The default is 0 which means that no VAD is used. The optimal value for a given audio signal may be difficult to determine as it depends on the loudness of the voice. In our experiments we used values between 0 and 30. You could print or plot the voice magnitude (computed in line 235) to get an intuition for an appropriate value. We recommend to use the option only if large errors are made on audio files with long instrumental sections.
This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068.
Copyright 2021 Kilian Schulze-Forster of Télécom Paris, Institut Polytechnique de Paris. All rights reserved.