VERSA (Versatile Evaluation of Speech and Audio) is a toolkit dedicated to collecting evaluation metrics in speech and audio quality. Our goal is to provide a comprehensive connection to the cutting-edge techniques developed for evaluation. The toolkit is also tightly integrated into ESPnet.
Colab Demonstration at Interspeech2024 Tutorial
The base-installation is as easy as follows:
git clone https://github.com/shinjiwlab/versa.git
cd versa
pip install .
As for collection purposes, VERSA instead of re-distributing the model, we try to align as much to the original API provided by the algorithm developer. Therefore, we have many dependencies. We try to include as many as default, but there are cases where the toolkit needs specific installation requirements. Please refer to our list-of-metric section for more details on whether the metrics are automatically included or not. If not, we provide an installation guide or installers in tools
.
python versa/test/test_general.py
# test metrics with additional installation
python versa/test/test_{metric}.py
Simple usage case for a few samples.
# direct usage
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1 \
--pred test/test_samples/test2 \
--output_file test_result
# with scp-style input
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result
# with kaldi-ark style
python versa/bin/scorer.py \
--score_config egs/speech.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result \
--io kaldi
# For text information
python versa/bin/scorer.py \
--score_config egs/separate_metrics/wer.yaml \
--gt test/test_samples/test1.scp \
--pred test/test_samples/test2.scp \
--output_file test_result \
--text test/test_samples/text
Use launcher with slurm job submissions
# use the launcher
# Option1: with gt speech
./launch.sh \
<pred_speech_scp> \
<gt_speech_scp> \
<score_dir> \
<split_job_num>
# Option2: without gt speech
./launch.sh \
<pred_speech_scp> \
None \
<score_dir> \
<split_job_num>
# aggregate the results
cat <score_dir>/result/*.result.cpu.txt > <score_dir>/utt_result.cpu.txt
cat <score_dir>/result/*.result.gpu.txt > <score_dir>/utt_result.gpu.txt
# show result
python scripts/show_result.py <score_dir>/utt_result.cpu.txt
python scripts/show_result.py <score_dir>/utt_result.gpu.txt
Access egs/*.yaml
for different configs for different setups.
We include x mark if the metric is auto-installed in versa.
Number | Auto-Install | Metric Name (Auto-Install) | Key in config | Key in report | Code Source | References |
---|---|---|---|---|---|---|
1 | x | Mel Cepstral Distortion (MCD) | mcd_f0 | mcd | espnet and s3prl-vc | paper |
2 | x | F0 Correlation | mcd_f0 | f0_corr | espnet and s3prl-vc | paper |
3 | x | F0 Root Mean Square Error | mcd_f0 | f0_rmse | espnet and s3prl-vc | paper |
4 | x | Signal-to-interference Ratio (SIR) | signal_metric | sir | espnet | - |
5 | x | Signal-to-artifact Ratio (SAR) | signal_metric | sar | espnet | - |
6 | x | Signal-to-distortion Ratio (SDR) | signal_metric | sdr | espnet | - |
7 | x | Convolutional scale-invariant signal-to-distortion ratio (CI-SDR) | signal_metric | ci-sdr | ci_sdr | paper |
8 | x | Scale-invariant signal-to-noise ratio (SI-SNR) | signal_metric | si-snr | espnet | paper |
9 | x | Perceptual Evaluation of Speech Quality (PESQ) | pesq | pesq | pesq | paper |
10 | x | Short-Time Objective Intelligibility (STOI) | stoi | stoi | pystoi | paper |
11 | x | Speech BERT Score | discrete_speech | speech_bert | discrete speech metric | paper |
12 | x | Discrete Speech BLEU Score | discrete_speech | speech_belu | discrete speech metric | paper |
13 | x | Discrete Speech Token Edit Distance | discrete_speech | speech_token_distance | discrete speech metric | paper |
14 | x | UTokyo-SaruLab System for VoiceMOS Challenge 2022 (UTMOS) | pseudo_mos | utmos | speechmos | paper |
15 | x | Deep Noise Suppression MOS Score of P.835 (DNSMOS) | pseudo_mos | dnsmos_overall | speechmos (MS) | paper |
16 | x | Deep Noise Suppression MOS Score of P.808 (DNSMOS) | pseudo_mos | dnsmos_p808 | speechmos (MS) | paper |
17 | x | Packet Loss Concealment-related MOS Score (PLCMOS) | pseudo_mos | plcmos | speechmos (MS) | paper |
18 | Virtual Speech Quality Objective Listener (VISQOL) | visqol | visqol | google-visqol | paper | |
19 | x | Speaker Embedding Similarity | speaker | spk_similarity | espnet | paper |
20 | x | PESQ in TorchAudio-Squim | squim_no_ref | torch_squim_pesq | torch_squim | paper |
21 | x | STOI in TorchAudio-Squim | squim_no_ref | torch_squim_stoi | torch_squim | paper |
22 | x | SI-SDR in TorchAudio-Squim | squim_no_ref | torch_squim_si_sdr | torch_squim | paper |
23 | x | MOS in TorchAudio-Squim | squim_ref | torch_squim_mos | torch_squim | paper |
24 | x | Singing voice MOS | singmos | singmos | singmos | paper |
25 | x | Log-Weighted Mean Square Error | log_wmse | log_wmse | log_wmse | |
26 | Dynamic Time Warping Cost Metric | warpq | warpq | WARP-Q | paper | |
27 | x | Sheet SSQA MOS Models | sheet_ssqa | sheet_ssqa | Sheet | paper |
28 | x | ESPnet Speech Recognition-based Error Rate | espnet_wer | espnet_wer | ESPnet | paper |
29 | x | ESPnet-OWSM Speech Recognition-based Error Rate | owsm_wer | owsm_wer | ESPnet | paper |
30 | x | OpenAI-Whisper Speech Recognition-based Error Rate | whisper_wer | whisper_wer | Whisper | paper |
31 | UTMOSv2: UTokyo-SaruLab MOS Prediction System | utmosv2 | utmosv2 | UTMOSv2 | paper | |
32 | Speech Contrastive Regression for Quality Assessment with reference (ScoreQ) | scoreq_ref | scoreq_ref | ScoreQ | paper | |
33 | Speech Contrastive Regression for Quality Assessment without reference (ScoreQ) | scoreq_nr | scoreq_nr | ScoreQ | paper | |
34 | Emotion2vec similarity (emo2vec) | emo2vec_similarity | emotion_similarity | emo2vec | paper | |
35 | x | Speech enhancement-based SI-SNR | se_snr | se_si_snr | ESPnet | |
36 | x | Speech enhancement-based CI-SDR | se_snr | se_ci_sdr | ESPnet | |
37 | x | Speech enhancement-based SAR | se_snr | se_sar | ESPnet | |
38 | x | Speech enhancement-based SDR | se_snr | se_sdr | ESPnet | |
39 | NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment | nomad | nomad | Nomad | paper | |
40 | Frechet Audio Distance (FAD) | fad | fad | fadtk | paper | |
41 | Contrastive Language-Audio Pretraining Score (CLAP Score) | clap_score | clap_score | fadtk | paper | |
42 | Audio Density and Coverage Score | audio_density_coverage | audio_density_coverage | Sony-audio-metrics | paper | |
43 | Accompaniment Prompt Adherence (APA) | apa | apa | Sony-audio-metrics | paper | |
44 | Kullback-Leibler Divergence on Embedding Distribution | kl_embedding | kl_embedding | Stability-AI | ||
45 | x | PAM: Prompting Audio-Language Models for Audio Quality Assessment | pam | pam | PAM | Paper |
46 | Frequency-Weighted SEGmental SNR (FWSEGSNR) | pysepm | pysepm_fwsegsnr | pysepm | Paper | |
47 | Log Likelihood Ratio (LLR) | pysepm | pysepm_llr | pysepm | Paper | |
48 | Weighted Spectral Slope (WSS) | pysepm | pysepm_wss | pysepm | Paper | |
49 | Cepstrum Distance Objective Speech Quality Measure (CD) | pysepm | pysepm_cd | pysepm | Paper | |
50 | Composite Objective Speech Quality (composite) | pysepm | pysepm_Csig, pysepm_Cbak, pysepm_Covl | pysepm | Paper | |
51 | Coherence and speech intelligibility index (CSII) | pysepm | pysepm_csii_high, pysepm_csii_mid, pysepm_csii_low | pysepm | Paper | |
52 | Normalized-covariance measure (NCM) | pysepm | pysepm_ncm | pysepm | Paper | |
51 | Coherence and Speech Intelligibility Index (CSII) | pysepm | pysepm_csii_high, pysepm_csii_mid, pysepm_csii_low | pysepm | Paper | |
52 | Normalized-Covariance Measure (NCM) | pysepm | pysepm_ncm | pysepm | Paper | |
53 | Speech-to-Reverberation Modulation energy Ratio (SRMR) | srmr | srmr | SRMRpy | Paper | |
54 | Voice Activity Detection (VAD) | vad | vad_info | SileroVAD | ||
55 | x | AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks | asvspoof_score | asvspoof_score | AASIST | Paper |
56 | NORESQA : A Framework for Speech Quality Assessment using Non-Matching References | noresqa | noresqa | Noresqa | Paper | |
57 | KID : Kernel Distance Metric for Audio/Music Quality | kid | kid | KID | Paper | |
A few more in verifying/progresss |
We sincerely thank all the open-source implementations listed in https://github.com/shinjiwlab/versa/tree/main#list-of-metrics