Skip to content

Latest commit

 

History

History
162 lines (137 loc) · 14.6 KB

README.md

File metadata and controls

162 lines (137 loc) · 14.6 KB

VERSA

VERSA (Versatile Evaluation of Speech and Audio) is a toolkit dedicated to collecting evaluation metrics in speech and audio quality. Our goal is to provide a comprehensive connection to the cutting-edge techniques developed for evaluation. The toolkit is also tightly integrated into ESPnet.

Colab Demonstration

Colab Demonstration at Interspeech2024 Tutorial

Install

The base-installation is as easy as follows:

git clone https://github.com/shinjiwlab/versa.git
cd versa
pip install .

As for collection purposes, VERSA instead of re-distributing the model, we try to align as much to the original API provided by the algorithm developer. Therefore, we have many dependencies. We try to include as many as default, but there are cases where the toolkit needs specific installation requirements. Please refer to our list-of-metric section for more details on whether the metrics are automatically included or not. If not, we provide an installation guide or installers in tools.

Quick test

python versa/test/test_general.py

# test metrics with additional installation
python versa/test/test_{metric}.py

Usage

Simple usage case for a few samples.

# direct usage
python versa/bin/scorer.py \
    --score_config egs/speech.yaml \
    --gt test/test_samples/test1 \
    --pred test/test_samples/test2 \
    --output_file test_result

# with scp-style input
python versa/bin/scorer.py \
    --score_config egs/speech.yaml \
    --gt test/test_samples/test1.scp \
    --pred test/test_samples/test2.scp \
    --output_file test_result

# with kaldi-ark style
python versa/bin/scorer.py \
    --score_config egs/speech.yaml \
    --gt test/test_samples/test1.scp \
    --pred test/test_samples/test2.scp \
    --output_file test_result \
    --io kaldi
  
# For text information
python versa/bin/scorer.py \
    --score_config egs/separate_metrics/wer.yaml \
    --gt test/test_samples/test1.scp \
    --pred test/test_samples/test2.scp \
    --output_file test_result \
    --text test/test_samples/text

Use launcher with slurm job submissions

# use the launcher
# Option1: with gt speech
./launch.sh \
  <pred_speech_scp> \
  <gt_speech_scp> \
  <score_dir> \
  <split_job_num> 

# Option2: without gt speech
./launch.sh \
  <pred_speech_scp> \
  None \
  <score_dir> \
  <split_job_num>

# aggregate the results
cat <score_dir>/result/*.result.cpu.txt > <score_dir>/utt_result.cpu.txt
cat <score_dir>/result/*.result.gpu.txt > <score_dir>/utt_result.gpu.txt

# show result
python scripts/show_result.py <score_dir>/utt_result.cpu.txt
python scripts/show_result.py <score_dir>/utt_result.gpu.txt 

Access egs/*.yaml for different configs for different setups.

List of Metrics

We include x mark if the metric is auto-installed in versa.

Number Auto-Install Metric Name (Auto-Install) Key in config Key in report Code Source References
1 x Mel Cepstral Distortion (MCD) mcd_f0 mcd espnet and s3prl-vc paper
2 x F0 Correlation mcd_f0 f0_corr espnet and s3prl-vc paper
3 x F0 Root Mean Square Error mcd_f0 f0_rmse espnet and s3prl-vc paper
4 x Signal-to-interference Ratio (SIR) signal_metric sir espnet -
5 x Signal-to-artifact Ratio (SAR) signal_metric sar espnet -
6 x Signal-to-distortion Ratio (SDR) signal_metric sdr espnet -
7 x Convolutional scale-invariant signal-to-distortion ratio (CI-SDR) signal_metric ci-sdr ci_sdr paper
8 x Scale-invariant signal-to-noise ratio (SI-SNR) signal_metric si-snr espnet paper
9 x Perceptual Evaluation of Speech Quality (PESQ) pesq pesq pesq paper
10 x Short-Time Objective Intelligibility (STOI) stoi stoi pystoi paper
11 x Speech BERT Score discrete_speech speech_bert discrete speech metric paper
12 x Discrete Speech BLEU Score discrete_speech speech_belu discrete speech metric paper
13 x Discrete Speech Token Edit Distance discrete_speech speech_token_distance discrete speech metric paper
14 x UTokyo-SaruLab System for VoiceMOS Challenge 2022 (UTMOS) pseudo_mos utmos speechmos paper
15 x Deep Noise Suppression MOS Score of P.835 (DNSMOS) pseudo_mos dnsmos_overall speechmos (MS) paper
16 x Deep Noise Suppression MOS Score of P.808 (DNSMOS) pseudo_mos dnsmos_p808 speechmos (MS) paper
17 x Packet Loss Concealment-related MOS Score (PLCMOS) pseudo_mos plcmos speechmos (MS) paper
18 Virtual Speech Quality Objective Listener (VISQOL) visqol visqol google-visqol paper
19 x Speaker Embedding Similarity speaker spk_similarity espnet paper
20 x PESQ in TorchAudio-Squim squim_no_ref torch_squim_pesq torch_squim paper
21 x STOI in TorchAudio-Squim squim_no_ref torch_squim_stoi torch_squim paper
22 x SI-SDR in TorchAudio-Squim squim_no_ref torch_squim_si_sdr torch_squim paper
23 x MOS in TorchAudio-Squim squim_ref torch_squim_mos torch_squim paper
24 x Singing voice MOS singmos singmos singmos paper
25 x Log-Weighted Mean Square Error log_wmse log_wmse log_wmse
26 Dynamic Time Warping Cost Metric warpq warpq WARP-Q paper
27 x Sheet SSQA MOS Models sheet_ssqa sheet_ssqa Sheet paper
28 x ESPnet Speech Recognition-based Error Rate espnet_wer espnet_wer ESPnet paper
29 x ESPnet-OWSM Speech Recognition-based Error Rate owsm_wer owsm_wer ESPnet paper
30 x OpenAI-Whisper Speech Recognition-based Error Rate whisper_wer whisper_wer Whisper paper
31 UTMOSv2: UTokyo-SaruLab MOS Prediction System utmosv2 utmosv2 UTMOSv2 paper
32 Speech Contrastive Regression for Quality Assessment with reference (ScoreQ) scoreq_ref scoreq_ref ScoreQ paper
33 Speech Contrastive Regression for Quality Assessment without reference (ScoreQ) scoreq_nr scoreq_nr ScoreQ paper
34 Emotion2vec similarity (emo2vec) emo2vec_similarity emotion_similarity emo2vec paper
35 x Speech enhancement-based SI-SNR se_snr se_si_snr ESPnet
36 x Speech enhancement-based CI-SDR se_snr se_ci_sdr ESPnet
37 x Speech enhancement-based SAR se_snr se_sar ESPnet
38 x Speech enhancement-based SDR se_snr se_sdr ESPnet
39 NOMAD: Unsupervised Learning of Perceptual Embeddings For Speech Enhancement and Non-Matching Reference Audio Quality Assessment nomad nomad Nomad paper
40 Frechet Audio Distance (FAD) fad fad fadtk paper
41 Contrastive Language-Audio Pretraining Score (CLAP Score) clap_score clap_score fadtk paper
42 Audio Density and Coverage Score audio_density_coverage audio_density_coverage Sony-audio-metrics paper
43 Accompaniment Prompt Adherence (APA) apa apa Sony-audio-metrics paper
44 Kullback-Leibler Divergence on Embedding Distribution kl_embedding kl_embedding Stability-AI
45 x PAM: Prompting Audio-Language Models for Audio Quality Assessment pam pam PAM Paper
46 Frequency-Weighted SEGmental SNR (FWSEGSNR) pysepm pysepm_fwsegsnr pysepm Paper
47 Log Likelihood Ratio (LLR) pysepm pysepm_llr pysepm Paper
48 Weighted Spectral Slope (WSS) pysepm pysepm_wss pysepm Paper
49 Cepstrum Distance Objective Speech Quality Measure (CD) pysepm pysepm_cd pysepm Paper
50 Composite Objective Speech Quality (composite) pysepm pysepm_Csig, pysepm_Cbak, pysepm_Covl pysepm Paper
51 Coherence and speech intelligibility index (CSII) pysepm pysepm_csii_high, pysepm_csii_mid, pysepm_csii_low pysepm Paper
52 Normalized-covariance measure (NCM) pysepm pysepm_ncm pysepm Paper
51 Coherence and Speech Intelligibility Index (CSII) pysepm pysepm_csii_high, pysepm_csii_mid, pysepm_csii_low pysepm Paper
52 Normalized-Covariance Measure (NCM) pysepm pysepm_ncm pysepm Paper
53 Speech-to-Reverberation Modulation energy Ratio (SRMR) srmr srmr SRMRpy Paper
54 Voice Activity Detection (VAD) vad vad_info SileroVAD
55 x AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks asvspoof_score asvspoof_score AASIST Paper
56 NORESQA : A Framework for Speech Quality Assessment using Non-Matching References noresqa noresqa Noresqa Paper
57 KID : Kernel Distance Metric for Audio/Music Quality kid kid KID Paper
A few more in verifying/progresss

Acknowledgement

We sincerely thank all the open-source implementations listed in https://github.com/shinjiwlab/versa/tree/main#list-of-metrics