diff --git a/joss.05403/10.21105.joss.05403.crossref.xml b/joss.05403/10.21105.joss.05403.crossref.xml new file mode 100644 index 0000000000..d63d5846f3 --- /dev/null +++ b/joss.05403/10.21105.joss.05403.crossref.xml @@ -0,0 +1,572 @@ + + + + 20231120T173622-f5cd4f87feb07d150e64a08e4dfe057a8a799847 + 20231120173622 + + JOSS Admin + admin@theoj.org + + The Open Journal + + + + + Journal of Open Source Software + JOSS + 2475-9066 + + 10.21105/joss + https://joss.theoj.org + + + + + 11 + 2023 + + + 8 + + 91 + + + + Software Design and User Interface of ESPnet-SE++: +Speech Enhancement for Robust Speech Processing + + + + Yen-Ju + Lu + https://orcid.org/0000-0001-8400-4188 + + + Xuankai + Chang + https://orcid.org/0000-0002-5221-5412 + + + Chenda + Li + https://orcid.org/0000-0003-0299-9914 + + + Wangyou + Zhang + https://orcid.org/0000-0003-4500-3515 + + + Samuele + Cornell + https://orcid.org/0000-0002-5358-1844 + + + Zhaoheng + Ni + + + Yoshiki + Masuyama + + + Brian + Yan + + + Robin + Scheibler + https://orcid.org/0000-0002-5205-8365 + + + Zhong-Qiu + Wang + https://orcid.org/0000-0002-4204-9430 + + + Yu + Tsao + https://orcid.org/0000-0001-6956-0418 + + + Yanmin + Qian + https://orcid.org/0000-0002-0314-3790 + + + Shinji + Watanabe + https://orcid.org/0000-0002-5970-8631 + + + + 11 + 20 + 2023 + + + 5403 + + + 10.21105/joss.05403 + + + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + http://creativecommons.org/licenses/by/4.0/ + + + + Software archive + 10.5281/zenodo.10048174 + + + GitHub review issue + https://github.com/openjournals/joss-reviews/issues/5403 + + + + 10.21105/joss.05403 + https://joss.theoj.org/papers/10.21105/joss.05403 + + + https://joss.theoj.org/papers/10.21105/joss.05403.pdf + + + + + + ESPnet-SE: End-to-end speech enhancement and +separation toolkit designed for ASR integration + Li + 2021 IEEE spoken language technology workshop +(SLT) + 10.1109/slt48900.2021.9383615 + 2021 + Li, C., Shi, J., Zhang, W., +Subramanian, A. S., Chang, X., Kamo, N., Hira, M., Hayashi, T., +Boeddeker, C., & Chen, S., Z. Watanabe. (2021). ESPnet-SE: +End-to-end speech enhancement and separation toolkit designed for ASR +integration. 2021 IEEE Spoken Language Technology Workshop (SLT), +785–792. +https://doi.org/10.1109/slt48900.2021.9383615 + + + Deep clustering: Discriminative embeddings +for segmentation and separation + Hershey + 2016 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp.2016.7471631 + 2016 + Hershey, J. R., Chen, Z., Le Roux, +J., & Watanabe, S. (2016). Deep clustering: Discriminative +embeddings for segmentation and separation. 2016 IEEE International +Conference on Acoustics, Speech and Signal Processing (ICASSP), 31–35. +https://doi.org/10.1109/icassp.2016.7471631 + + + Deep attractor network for single-microphone +speaker separation + Chen + 2017 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp.2017.7952155 + 2017 + Chen, Z., Luo, Y., & Mesgarani, +N. (2017). Deep attractor network for single-microphone speaker +separation. 2017 IEEE International Conference on Acoustics, Speech and +Signal Processing (ICASSP), 246–250. +https://doi.org/10.1109/icassp.2017.7952155 + + + DCCRN: Deep complex convolution recurrent +network for phase-aware speech enhancement + Hu + Proceedings of interspeech + 10.21437/interspeech.2020-2537 + 2020 + Hu, Y., Liu, Y., Lv, S., Xing, M., +Zhang, S., Fu, Y., Wu, J., Zhang, B., & Xie, L. (2020). DCCRN: Deep +complex convolution recurrent network for phase-aware speech +enhancement. Proceedings of Interspeech, 2472–2476. +https://doi.org/10.21437/interspeech.2020-2537 + + + Deep learning based real-time speech +enhancement for dual-microphone mobile phones + Tan + IEEE/ACM Transactions on Audio, Speech, and +Language Processing + 29 + 10.1109/taslp.2021.3082318 + 2021 + Tan, K., Zhang, X., & Wang, D. +(2021). Deep learning based real-time speech enhancement for +dual-microphone mobile phones. IEEE/ACM Transactions on Audio, Speech, +and Language Processing, 29, 1853–1863. +https://doi.org/10.1109/taslp.2021.3082318 + + + SkiM: Skipping memory lstm for low-latency +real-time continuous speech separation + Li + 2022 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp43922.2022.9746372 + 2022 + Li, C., Yang, L., Wang, W., & +Qian, Y. (2022). SkiM: Skipping memory lstm for low-latency real-time +continuous speech separation. 2022 IEEE International Conference on +Acoustics, Speech and Signal Processing (ICASSP), 681–685. +https://doi.org/10.1109/icassp43922.2022.9746372 + + + DPT-FSNet: Dual-path transformer based +full-band and sub-band fusion network for speech +enhancement + Dang + 2022 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp43922.2022.9746171 + 2022 + Dang, F., Chen, H., & Zhang, P. +(2022). DPT-FSNet: Dual-path transformer based full-band and sub-band +fusion network for speech enhancement. 2022 IEEE International +Conference on Acoustics, Speech and Signal Processing (ICASSP), +6857–6861. +https://doi.org/10.1109/icassp43922.2022.9746171 + + + Recursive speech separation for unknown +number of speakers + Takahashi + Interspeech 2019 + 10.21437/interspeech.2019-1550 + 2019 + Takahashi, N., Parthasaarathy, S., +Goswami, N., & Mitsufuji, Y. (2019). Recursive speech separation for +unknown number of speakers. Interspeech 2019, 1348–1352. +https://doi.org/10.21437/interspeech.2019-1550 + + + FaSNet: Low-latency adaptive beamforming for +multi-microphone audio processing + Luo + 2019 IEEE automatic speech recognition and +understanding workshop (ASRU) + 10.1109/asru46091.2019.9003849 + 2019 + Luo, Y., Han, C., Mesgarani, N., +Ceolini, E., & Liu, S. (2019). FaSNet: Low-latency adaptive +beamforming for multi-microphone audio processing. 2019 IEEE Automatic +Speech Recognition and Understanding Workshop (ASRU), 260–267. +https://doi.org/10.1109/asru46091.2019.9003849 + + + Towards low-distortion multi-channel speech +enhancement: The ESPNET-se submission to the L3DAS22 +challenge + Lu + 2022 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp43922.2022.9747146 + 2022 + Lu, Y. J., Cornell, S., Chang, X., +Zhang, W., Li, C., Ni, Z., Wang, Z., & Watanabe, S. (2022). Towards +low-distortion multi-channel speech enhancement: The ESPNET-se +submission to the L3DAS22 challenge. 2022 IEEE International Conference +on Acoustics, Speech and Signal Processing (ICASSP), 9201–9205. +https://doi.org/10.1109/icassp43922.2022.9747146 + + + TaSNet: Time-domain audio separation network +for real-time, single-channel speech separation + Luo + 2018 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp.2018.8462116 + 2018 + Luo, Y., & Mesgarani, N. (2018). +TaSNet: Time-domain audio separation network for real-time, +single-channel speech separation. 2018 IEEE International Conference on +Acoustics, Speech and Signal Processing (ICASSP), 696–700. +https://doi.org/10.1109/icassp.2018.8462116 + + + SDR half-baked or well done? + Le Roux + 2019 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp.2019.8683855 + 2019 + Le Roux, J., Wisdom, S., Erdogan, H., +& Hershey, J. R. (2019). SDR half-baked or well done? 2019 IEEE +International Conference on Acoustics, Speech and Signal Processing +(ICASSP), 626–630. +https://doi.org/10.1109/icassp.2019.8683855 + + + Convolutive transfer function invariant SDR +training criteria for multi-channel reverberant speech +separation + Boeddeker + 2021 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp39728.2021.9414661 + 2021 + Boeddeker, C., Zhang, W., Nakatani, +T., Kinoshita, K., Ochiai, T., Delcroix, M., Kamo, N., Qian, Y., & +Haeb-Umbach, R. (2021). Convolutive transfer function invariant SDR +training criteria for multi-channel reverberant speech separation. 2021 +IEEE International Conference on Acoustics, Speech and Signal Processing +(ICASSP), 8428–8432. +https://doi.org/10.1109/icassp39728.2021.9414661 + + + SDR medium rare with fast +computations + Scheibler + 2022 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp43922.2022.9747473 + 2022 + Scheibler, R. (2022). SDR medium rare +with fast computations. 2022 IEEE International Conference on Acoustics, +Speech and Signal Processing (ICASSP), 701–705. +https://doi.org/10.1109/icassp43922.2022.9747473 + + + ESPnet-SE++: Speech enhancement for robust +speech recognition, translation, and understanding + Lu + Proceedings of interspeech + 10.21437/interspeech.2022-10727 + 2022 + Lu, Y. J., Chang, X., Li, C., Zhang, +W., Cornell, S., Ni, Z., Masuyama, Y., Yan, B., Scheibler, R., Wang, Z. +Q., Tsao, Y., & Qian Y. Watanabe, S. (2022). ESPnet-SE++: Speech +enhancement for robust speech recognition, translation, and +understanding. Proceedings of Interspeech, 5458–5462. +https://doi.org/10.21437/interspeech.2022-10727 + + + ESPnet-TTS: Unified, reproducible, and +integratable open source end-to-end text-to-speech +toolkit + Hayashi + 2020 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp40776.2020.9053512 + 2020 + Hayashi, T., Yamamoto, R., Inoue, K., +Yoshimura, T., Watanabe, S., Toda, T., Takeda, K., & Zhang, X., Y. +Tan. (2020). ESPnet-TTS: Unified, reproducible, and integratable open +source end-to-end text-to-speech toolkit. 2020 IEEE International +Conference on Acoustics, Speech and Signal Processing (ICASSP), +7654–7658. +https://doi.org/10.1109/icassp40776.2020.9053512 + + + ESPnet-ST: All-in-one speech translation +toolkit + Inaguma + Proceedings of the 58th annual meeting of the +association for computational linguistics: System +demonstrations + 10.18653/v1/2020.acl-demos.34 + 2020 + Inaguma, H., Kiyono, S., Duh, K., +Karita, S., Soplin, N. E. Y., Hayashi, T., & Watanabe, S. (2020). +ESPnet-ST: All-in-one speech translation toolkit. Proceedings of the +58th Annual Meeting of the Association for Computational Linguistics: +System Demonstrations, 302–311. +https://doi.org/10.18653/v1/2020.acl-demos.34 + + + ESPnet-SLU: Advancing spoken language +understanding through ESPnet + Arora + 2022 IEEE international conference on +acoustics, speech and signal processing (ICASSP) + 10.1109/icassp43922.2022.9747674 + 2022 + Arora, S., Dalmia, S., Denisov, P., +Chang, X., Ueda, Y., Peng, Y., Zhang, Y., Kumar, S., Ganesan, K., & +Yan, W., B. (2022). ESPnet-SLU: Advancing spoken language understanding +through ESPnet. 2022 IEEE International Conference on Acoustics, Speech +and Signal Processing (ICASSP), 7167–7171. +https://doi.org/10.1109/icassp43922.2022.9747674 + + + ESPnet: End-to-end speech processing +toolkit + Watanabe + Proceedings of interspeech + 10.21437/interspeech.2018-1456 + 2018 + Watanabe, S., Hori, T., Karita, S., +Hayashi, T., Nishitoba, J., Unno, Y., Soplin, N. E. Y., Heymann, J., +Wiesner, M., Chen, N., Renduchintala, A., & Ochiai, T. (2018). +ESPnet: End-to-end speech processing toolkit. Proceedings of +Interspeech, 2207–2211. +https://doi.org/10.21437/interspeech.2018-1456 + + + The northwestern university source separation +library. + Manilow + International society for music information +retrieval (ISMIR) + 10.1163/1872-9037_afco_asc_1322 + 2018 + Manilow, E., Seetharaman, P., & +Pardo, B. (2018). The northwestern university source separation library. +International Society for Music Information Retrieval (ISMIR), 297–305. +https://doi.org/10.1163/1872-9037_afco_asc_1322 + + + ONSSEN: An open-source speech separation and +enhancement library + Ni + arXiv preprint +arXiv:1911.00982 + 2019 + Ni, M. I., Zhaoheng Mandel. (2019). +ONSSEN: An open-source speech separation and enhancement library. arXiv +Preprint arXiv:1911.00982. + + + Asteroid: The PyTorch-based audio source +separation toolkit for researchers + Pariente + Proceedings of interspeech + 10.21437/interspeech.2020-1673 + 2020 + Pariente, M., Cornell, S., Cosentino, +J., Sivasankaran, S., Tzinis, E., Heitkaemper, J., Olvera, M., Stöter, +F. R., Hu, M., Martı́n-Doñas, J. M., Ditter, D., Frank, A., Deleforge, +A., & Vincent, E. (2020). Asteroid: The PyTorch-based audio source +separation toolkit for researchers. Proceedings of Interspeech, +2637–2641. +https://doi.org/10.21437/interspeech.2020-1673 + + + SpeechBrain: A general-purpose speech +toolkit + Ravanelli + arXiv preprint +arXiv:2106.04624 + 2021 + Ravanelli, M., Parcollet, T., +Plantinga, P., Rouhe, A., Cornell, S., Lugosch, L., Subakan, C., +Dawalatabad, N., Heba, A., Zhong, J., Chou, J. C., Yeh, S. L., Fu, S. +W., Liao, C. F., Rastorgueva, E., Grondin, F., Aris, W., Na, H., Gao, +Y., & Mori R. D. Bengio, Y. (2021). SpeechBrain: A general-purpose +speech toolkit. arXiv Preprint arXiv:2106.04624. + + + The Kaldi speech recognition +toolkit + Povey + IEEE 2011 workshop on automatic speech +recognition and understanding + 10.15199/48.2016.11.70 + 2011 + Povey, D., Ghoshal, A., Boulianne, +G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., +Qian, Y., Schwarz, P., Silovsky´, J., & Stemmer, K., G. Vesely. +(2011). The Kaldi speech recognition toolkit. IEEE 2011 Workshop on +Automatic Speech Recognition and Understanding. +https://doi.org/10.15199/48.2016.11.70 + + + An algorithm for intelligibility prediction +of time–frequency weighted noisy speech + Taal + IEEE Transactions on Audio, Speech, and +Language Processing + 7 + 19 + 10.1109/tasl.2011.2114881 + 2011 + Taal, C. H., Hendriks, R. C., +Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility +prediction of time–frequency weighted noisy speech. IEEE Transactions on +Audio, Speech, and Language Processing, 19(7), 2125–2136. +https://doi.org/10.1109/tasl.2011.2114881 + + + Perceptual evaluation of speech quality +(PESQ)-a new method for speech quality assessment of telephone networks +and codecs + Rix + 2001 IEEE international conference on +acoustics, speech, and signal processing. Proceedings (cat. No. +01CH37221) + 2 + 10.1109/icassp.2001.941023 + 2001 + Rix, A. W., Beerends, J. G., Hollier, +M. P., & Hekstra, A. P. (2001). Perceptual evaluation of speech +quality (PESQ)-a new method for speech quality assessment of telephone +networks and codecs. 2001 IEEE International Conference on Acoustics, +Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221), 2, +749–752. +https://doi.org/10.1109/icassp.2001.941023 + + + XSEDE: Accelerating scientific +discovery + Towns + Computing in Science & +Engineering + 5 + 16 + 10.1109/mcse.2014.80 + 2014 + Towns, J., Cockerill, T., Dahan, M., +Foster, I., Gaither, K., Grimshaw, A., Hazlewood, V., Lathrop, S., +Lifka, D., Peterson, G. D., Roskies, R., Scott, J. R., & +Wilkins-Diehr, N. (2014). XSEDE: Accelerating scientific discovery. +Computing in Science & Engineering, 16(5), 62–74. +https://doi.org/10.1109/mcse.2014.80 + + + Bridges: A uniquely flexible HPC resource for +new communities and data analytics + Nystrom + Proceedings of the 2015 XSEDE conference: +Scientific advancements enabled by enhanced +cyberinfrastructure + 10.1145/2792745.2792775 + 2015 + Nystrom, N. A., Levine, M. J., +Roskies, R. Z., & Scott, J. R. (2015). Bridges: A uniquely flexible +HPC resource for new communities and data analytics. Proceedings of the +2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced +Cyberinfrastructure, 1–8. +https://doi.org/10.1145/2792745.2792775 + + + + + + diff --git a/joss.05403/10.21105.joss.05403.jats b/joss.05403/10.21105.joss.05403.jats new file mode 100644 index 0000000000..264d646e2f --- /dev/null +++ b/joss.05403/10.21105.joss.05403.jats @@ -0,0 +1,1139 @@ + + +
+ + + + +Journal of Open Source Software +JOSS + +2475-9066 + +Open Journals + + + +5403 +10.21105/joss.05403 + +Software Design and User Interface of ESPnet-SE++: Speech +Enhancement for Robust Speech Processing + + + +https://orcid.org/0000-0001-8400-4188 + +Lu +Yen-Ju + + + + +https://orcid.org/0000-0002-5221-5412 + +Chang +Xuankai + + + + +https://orcid.org/0000-0003-0299-9914 + +Li +Chenda + + + + +https://orcid.org/0000-0003-4500-3515 + +Zhang +Wangyou + + + + +https://orcid.org/0000-0002-5358-1844 + +Cornell +Samuele + + + + + + +Ni +Zhaoheng + + + + + +Masuyama +Yoshiki + + + + + + +Yan +Brian + + + + +https://orcid.org/0000-0002-5205-8365 + +Scheibler +Robin + + + + +https://orcid.org/0000-0002-4204-9430 + +Wang +Zhong-Qiu + + + + +https://orcid.org/0000-0001-6956-0418 + +Tsao +Yu + + + + +https://orcid.org/0000-0002-0314-3790 + +Qian +Yanmin + + + + +https://orcid.org/0000-0002-5970-8631 + +Watanabe +Shinji + + +* + + + +Johns Hopkins University, USA + + + + +Carnegie Mellon University, USA + + + + +Shanghai Jiao Tong University, Shanghai + + + + +Universita` Politecnica delle Marche, Italy + + + + +Meta AI, USA + + + + +Tokyo Metropolitan University, Japan + + + + +LINE Corporation, Japan + + + + +Academia Sinica, Taipei + + + + +* E-mail: + + +22 +8 +2022 + +8 +91 +5403 + +Authors of papers retain copyright and release the +work under a Creative Commons Attribution 4.0 International License (CC +BY 4.0) +2022 +The article authors + +Authors of papers retain copyright and release the work under +a Creative Commons Attribution 4.0 International License (CC BY +4.0) + + + +Python +ESPnet +speech processing +speech enhancement + + + + + +

The Joint-task Systems of SSE with ASR, ST, and SLU in + ESPnet-SE++.

+ +
+ + Summary +

This paper presents the software design and user interface of + ESPnet-SE++, a new speech separation and enhancement (SSE) module of + the ESPnet toolkit. ESPnet-SE++ significantly expands the + functionality of ESPnet-SE + (Li + et al., 2021) with several new + models(Chen + et al., 2017; + Dang + et al., 2022; + Hershey + et al., 2016; + Hu + et al., 2020; + Li + et al., 2022; + Lu, + Cornell, et al., 2022; + Luo + et al., 2019; + Takahashi + et al., 2019; + Tan + et al., 2021), loss functions + (Boeddeker + et al., 2021; + Le + Roux et al., 2019; + Luo + & Mesgarani, 2018; + Scheibler, + 2022), and training recipes as shown in + (Lu, + Chang, et al., 2022). Crucially, it features a new, redesigned + interface, which allows for a flexible combination of SSE front-ends + with many downstream tasks, including automatic speech recognition + (ASR), speaker diarization (SD), speech translation (ST), and spoken + language understanding (SLU).

+
+ + Statement of need +

ESPnet + is an open-source toolkit for speech processing, including several + ASR, text-to-speech (TTS) + (Hayashi + et al., 2020), ST + (Inaguma + et al., 2020), machine translation (MT), SLU + (Arora + et al., 2022), and SSE recipes + (Watanabe + et al., 2018). Compared with other open-source SSE toolkits, + such as Nussl + (Manilow + et al., 2018), Onssen + (Ni, + 2019), Asteroid + (Pariente + et al., 2020), and SpeechBrain + (Ravanelli + et al., 2021), the modularized design in ESPnet-SE++ allows for + the joint training of SSE modules with other tasks. Currently, + ESPnet-SE++ supports 20 SSE recipes with 24 different + enhancement/separation models.

+
+ + ESPnet-SE++ Recipes and Software Structure + + ESPNet-SE++ Recipes for SSE and Joint-Task +

For each task, ESPnet-SE++, following the ESPnet2 style, provides + common scripts which are carefully designed to work out-of-the-box + with a wide variety of corpora. The recipes for different corpora + are under the egs2/ folder. Under the + egs2/TEMPLATE folder, the common scripts + enh1/enh.sh and + enh_asr1/enh_asr.sh are shared for all the + SSE and joint-task recipes. The directory structure can be found in + TEMPLATE/enh_asr1/README.md.

+ + Common Scripts +

enh.sh contains 13 stages, and the + details for the scripts can be found in + TEMPLATE/enh1/README.md.

+ +

enh_asr.sh contains 17 stages, and the + details for the scripts can be found in + TEMPLATE/enh_asr1/README.md. + The enh_diar.sh and + enh_st.sh are similar to it.

+ +
+ + Training Configuration + + SSE Task Training Configuration +

An example of an enhancement task for the CHiME-4 + enh1 recipe is configured as + conf/tuning/train_enh_dprnn_tasnet.yaml. + This file includes the specific types of + encoder, decoder, + separator, and their respective settings. + Furthermore, the file also defines the training setup and + criterions.

+
+ + Joint-Task Training Configuration +

An example of joint-task training configuration is the + CHiME-4 enh_asr1 recipe, configured as + conf/tuning/train_enh_asr_convtasnet.yaml. + This joint-task comprises of a front-end SSE model and a + back-end ASR model. The configuration file includes + specifications for the encoder, + decoder, + separator, and + criterions of both the SSE and ASR + models,using prefixes such as enh_ and + asr_.

+
+
+
+ + ESPNet-SE++ Software Structure for SSE Task +

The directory structure for the SSE python files can be found in + TEMPLATE/enh1/README.md. + Additionally, the UML diagram for the enhancement-only task in + ESPNet-SE++ is provided below.

+ +

UML Diagram for Speech Separation and Enhancement in + ESPnet-SE++

+ +
+ + SSE Executable Code <monospace>bin/*</monospace> + + bin/enh_train.py +

As the main interface for the SSE training stage of + enh.sh, + enh_train.py takes the training + parameters and model configurations from the arguments and + calls

+ EnhancementTask.main(...) +

to build an SSE object + ESPnetEnhancementModel for training the + SSE model according to the model configuration.

+
+ + bin/enh_inference.py +

The inference function in + enh_inference.py creates a

+ class SeparateSpeech +

object with the data-iterator for testing and validation. + During its initialization, this class instantiate an SSE object + ESPnetEnhancementModel based on a pair of + configuration and a pre-trained SSE model.

+
+ + bin/enh_scoring.py + def scoring(..., ref_scp, inf_scp, ...) +

The SSE scoring functions calculates several popular + objective scores such as SI-SDR + (Le + Roux et al., 2019), STOI + (Taal + et al., 2011), SDR and PESQ + (Rix + et al., 2001), based on the reference signal and + processed speech pairs.

+
+
+ + SSE Control Class + <monospace>tasks/enh.py</monospace> + class EnhancementTask(AbsTask) +

EnhancementTask is a control class which + is designed for SSE tasks. It contains class methods for building + and training an SSE model. Class method + build_model creates and returns an SSE + object ESPnetEnhancementModel.

+
+ + SSE Modules + <monospace>enh/espnet_model.py</monospace> + class ESPnetEnhancementModel(AbsESPnetModel) +

ESPnetEnhancementModel is the base class + for any ESPnet-SE++ SSE model. Since it inherits the same abstract + base class AbsESPnetModel, it is + well-aligned with other tasks such as ASR, TTS, ST, and SLU, + bringing the benefits of cross-tasks combination.

+ def forward(self, speech_mix, speech_ref, ...) +

The forward function of + ESPnetEnhancementModel follows the general + design in the ESPnet single-task modules, which processes speech + and only returns losses for + Trainer + to update the model.

+ def forward_enhance(self, speech_mix, ...) + def forward_loss(self, speech_pre, speech_ref, ...) +

For more flexible combinations, the + forward_enhance function returns the + enhanced speech, and the forward_loss + function returns the loss. The joint-training methods take the + enhanced speech as the input for the downstream task and the SSE + loss as a part of the joint-training loss.

+
+
+ + ESPNet-SE++ Software Structure for Joint-Task +

The directory structure for the SSE python files can be found in + TEMPLATE/enh_asr1/README.md. + Furthermore, the UML diagram for the joint-task in ESPNet-SE++ is + displayed below.

+ +

UML Diagram for Joint-Task in + ESPnet-SE++

+ +
+ + Joint-Task Executable Code + <monospace>bin/*</monospace> + + bin/enh_s2t_train.py +

Similarly to the interface of SSE training code + enh_train.py, + enh_s2t_train.py takes the training and + modular parameters from the scripts, and calls

+ tasks.enh_s2t.EnhS2TTask.main(...) +

to build a joint-task object for training the joint-model + based on a configuration with both SSE and s2t models setting + with or without pre-trained checkpoints.

+
+ + bin/asr_inference.py, bin/diar_inference.py, and + bin/st_inference.py +

The inference function in + asr_inference.py, + diar_inference.py, and + st_inference.py builds and call a

+ class Speech2Text + class DiarizeSpeech +

object with the data-iterator for testing and validation. + During their initialization, the classes build a joint-task + object ESPnetEnhS2TModel with pre-trained + joint-task models and configurations.

+
+
+ + Joint-task Control Class + <monospace>tasks/enh_s2t.py</monospace> + class EnhS2TTask(AbsTask) +

class EnhS2TTask is designed for + joint-task model. The subtask models are created and sent into the + ESPnetEnhS2TModel to create a joint-task + object.

+
+ + Joint-Task Modules + <monospace>enh/espnet_enh_s2t_model.py</monospace> + class ESPnetEnhS2TModel(AbsESPnetModel) +

The ESPnetEnhS2TModel takes a front-end + enh_model, and a back-end + s2t_model (such as ASR, SLU, ST, and SD + models) as inputs to build a joint-model.

+ +

The forward function of the class + follows the general design in ESPnet2:

+ def forward(self, speech_mix, speech_ref, ...) +

which processes speech and only returns losses for + Trainer + to update the model.

+
+
+
+ + ESPnet-SE++ User Interface + + Building a New Recipe from Scratch +

Since ESPnet2 provides common scripts such as + enh.sh and enh_asr.sh + for each task, users only need to create + local/data.sh for the data preparation of a + new corpus. The generated data follows the Kaldi-style structure + (Povey + et al., 2011):

+ +

The detailed instructions for data preparation and building new + recipes in espnet2 are described in the + link.

+
+ + Inference with Pre-trained Models +

Pretrained models from ESPnet are provided on HuggingFace and + Zenodo. Users can download and infer with the + models.model_name in the following section + should be huggingface_id or one of the tags + in the + table.csv + in + espnet_model_zoo + . Users can also directly provide a Zenodo URL or a HuggingFace + URL.

+ + Inference API +

The inference functions are from the + enh_inference and + enh_asr_inference in the executable code + bin/

+ +

Calling SeparateSpeech and + Speech2Text with unprocessed audios returns + the separated speech and their recognition results.

+ + SSE + + + + Joint-Task + +

The details for downloading models and inference are + described in + espnet_model_zoo.

+
+
+
+
+ + Demonstrations +

The demonstrations of ESPnet-SE can be found in the following + google colab links:

+ + +

ESPnet + SSE Demonstration: CHiME-4 and WSJ0-2mix

+
+ +

ESPnet-SE++ + Joint-Task Demonstration: L3DAS22 Challenge and + SLURP-Spatialized

+
+
+
+ + Development plan +

The development plan of the ESPnet-SE++ can be found in + Development + plan for ESPnet2 speech enhancement. In addition, we will + explore the combinations with other front-end tasks, such as using ASR + as a front-end model and TTS as a back-end model for speech-to-speech + conversion.

+
+ + Conclusions +

In this paper, we introduce the software structure and the user + interface of ESPnet-SE++, including the SSE task and joint-task + models. ESPnet-SE++ provides general recipes for training models on + different corpus and a simple way for adding new recipes. The + joint-task implementation further shows that the modularized design + improves the flexibility of ESPnet.

+
+ + Acknowledgement +

This work used the Extreme Science and Engineering Discovery + Environment (XSEDE) + (Towns + et al., 2014), which is supported by NSF grant number + ACI-1548562. Specifically, it used the Bridges system + (Nystrom + et al., 2015), which is supported by NSF award number + ACI-1445606, at the Pittsburgh Supercomputing Center (PSC).

+
+ + + + + + + LiC. + ShiJ. + ZhangW. + SubramanianA. S. + ChangX. + KamoN. + HiraM. + HayashiT. + BoeddekerC. + ChenS.Z. Watanabe + + ESPnet-SE: End-to-end speech enhancement and separation toolkit designed for ASR integration + 2021 IEEE spoken language technology workshop (SLT) + IEEE + 2021 + 10.1109/slt48900.2021.9383615 + 785 + 792 + + + + + + HersheyJ. R. + ChenZ. + Le RouxJ. + WatanabeS. + + Deep clustering: Discriminative embeddings for segmentation and separation + 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 2016 + 10.1109/icassp.2016.7471631 + 31 + 35 + + + + + + ChenZ. + LuoY. + MesgaraniN. + + Deep attractor network for single-microphone speaker separation + 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 2017 + 10.1109/icassp.2017.7952155 + 246 + 250 + + + + + + HuY. + LiuY. + LvS. + XingM. + ZhangS. + FuY. + WuJ. + ZhangB. + XieL. + + DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement + Proceedings of interspeech + 2020 + 10.21437/interspeech.2020-2537 + 2472 + 2476 + + + + + + TanK. + ZhangX. + WangD. + + Deep learning based real-time speech enhancement for dual-microphone mobile phones + IEEE/ACM Transactions on Audio, Speech, and Language Processing + IEEE + 2021 + 29 + 10.1109/taslp.2021.3082318 + 1853 + 1863 + + + + + + LiC. + YangL. + WangW. + QianY. + + SkiM: Skipping memory lstm for low-latency real-time continuous speech separation + 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 202205 + https://doi.org/10.1109%2Ficassp43922.2022.9746372 + 10.1109/icassp43922.2022.9746372 + 681 + 685 + + + + + + DangF. + ChenH. + ZhangP. + + DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement + 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 202205 + https://doi.org/10.1109%2Ficassp43922.2022.9746171 + 10.1109/icassp43922.2022.9746171 + 6857 + 6861 + + + + + + TakahashiN. + ParthasaarathyS. + GoswamiN. + MitsufujiY. + + Recursive speech separation for unknown number of speakers + Interspeech 2019 + ISCA + 201909 + https://doi.org/10.21437%2Finterspeech.2019-1550 + 10.21437/interspeech.2019-1550 + 1348 + 1352 + + + + + + LuoY. + HanC. + MesgaraniN. + CeoliniE. + LiuS. + + FaSNet: Low-latency adaptive beamforming for multi-microphone audio processing + 2019 IEEE automatic speech recognition and understanding workshop (ASRU) + IEEE + 201912 + https://doi.org/10.1109%2Fasru46091.2019.9003849 + 10.1109/asru46091.2019.9003849 + 260 + 267 + + + + + + LuY. J. + CornellS. + ChangX. + ZhangW. + LiC. + NiZ. + WangZ. + WatanabeS. + + Towards low-distortion multi-channel speech enhancement: The ESPNET-se submission to the L3DAS22 challenge + 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 202205 + https://doi.org/10.1109%2Ficassp43922.2022.9747146 + 10.1109/icassp43922.2022.9747146 + 9201 + 9205 + + + + + + LuoY. + MesgaraniN. + + TaSNet: Time-domain audio separation network for real-time, single-channel speech separation + 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 201804 + https://doi.org/10.1109%2Ficassp.2018.8462116 + 10.1109/icassp.2018.8462116 + 696 + 700 + + + + + + Le RouxJ. + WisdomS. + ErdoganH. + HersheyJ. R. + + SDR half-baked or well done? + 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 201905 + https://doi.org/10.1109%2Ficassp.2019.8683855 + 10.1109/icassp.2019.8683855 + 626 + 630 + + + + + + BoeddekerC. + ZhangW. + NakataniT. + KinoshitaK. + OchiaiT. + DelcroixM. + KamoN. + QianY. + Haeb-UmbachR. + + Convolutive transfer function invariant SDR training criteria for multi-channel reverberant speech separation + 2021 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 202106 + https://doi.org/10.1109%2Ficassp39728.2021.9414661 + 10.1109/icassp39728.2021.9414661 + 8428 + 8432 + + + + + + ScheiblerR. + + SDR medium rare with fast computations + 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 202205 + https://doi.org/10.1109%2Ficassp43922.2022.9747473 + 10.1109/icassp43922.2022.9747473 + 701 + 705 + + + + + + LuY. J. + ChangX. + LiC. + ZhangW. + CornellS. + NiZ. + MasuyamaY. + YanB. + ScheiblerR. + WangZ. Q. + TsaoY. + Qian Y. WatanabeS. + + ESPnet-SE++: Speech enhancement for robust speech recognition, translation, and understanding + Proceedings of interspeech + 2022 + 10.21437/interspeech.2022-10727 + 5458 + 5462 + + + + + + HayashiT. + YamamotoR. + InoueK. + YoshimuraT. + WatanabeS. + TodaT. + TakedaK. + ZhangX.Y. Tan + + ESPnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit + 2020 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 2020 + 10.1109/icassp40776.2020.9053512 + 7654 + 7658 + + + + + + InagumaH. + KiyonoS. + DuhK. + KaritaS. + SoplinN. E. Y. + HayashiT. + WatanabeS. + + ESPnet-ST: All-in-one speech translation toolkit + Proceedings of the 58th annual meeting of the association for computational linguistics: System demonstrations + Association for Computational Linguistics + 2020 + 10.18653/v1/2020.acl-demos.34 + 302 + 311 + + + + + + AroraS. + DalmiaS. + DenisovP. + ChangX. + UedaY. + PengY. + ZhangY. + KumarS. + GanesanK. + YanWatanabeB + + ESPnet-SLU: Advancing spoken language understanding through ESPnet + 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP) + IEEE + 2022 + 10.1109/icassp43922.2022.9747674 + 7167 + 7171 + + + + + + WatanabeS. + HoriT. + KaritaS. + HayashiT. + NishitobaJ. + UnnoY. + SoplinN. E. Y. + HeymannJ. + WiesnerM. + ChenN. + RenduchintalaA. + OchiaiT. + + ESPnet: End-to-end speech processing toolkit + Proceedings of interspeech + 2018 + 10.21437/interspeech.2018-1456 + 2207 + 2211 + + + + + + ManilowE. + SeetharamanP. + PardoB. + + The northwestern university source separation library. + International society for music information retrieval (ISMIR) + 2018 + 10.1163/1872-9037_afco_asc_1322 + 297 + 305 + + + + + + NiMichael IZhaoheng Mandel + + ONSSEN: An open-source speech separation and enhancement library + arXiv preprint arXiv:1911.00982 + 2019 + + + + + + ParienteM. + CornellS. + CosentinoJ. + SivasankaranS. + TzinisE. + HeitkaemperJ. + OlveraM. + StöterF. R. + HuM. + Martı́n-DoñasJ. M. + DitterD. + FrankA. + DeleforgeA. + VincentE. + + Asteroid: The PyTorch-based audio source separation toolkit for researchers + Proceedings of interspeech + 2020 + 10.21437/interspeech.2020-1673 + 2637 + 2641 + + + + + + RavanelliM. + ParcolletT. + PlantingaP. + RouheA. + CornellS. + LugoschL. + SubakanC. + DawalatabadN. + HebaA. + ZhongJ. + ChouJ. C. + YehS. L. + FuS. W. + LiaoC. F. + RastorguevaE. + GrondinF. + ArisW. + NaH. + GaoY. + Mori R. D. BengioY. + + SpeechBrain: A general-purpose speech toolkit + arXiv preprint arXiv:2106.04624 + 2021 + + + + + + PoveyD. + GhoshalA. + BoulianneG. + BurgetL. + GlembekO. + GoelN. + HannemannM. + MotlicekP. + QianY. + SchwarzP. + Silovsky´J. + StemmerK.G. Vesely + + The Kaldi speech recognition toolkit + IEEE 2011 workshop on automatic speech recognition and understanding + IEEE Signal Processing Society + 2011 + 10.15199/48.2016.11.70 + + + + + + TaalC. H. + HendriksR. C. + HeusdensR. + JensenJ. + + An algorithm for intelligibility prediction of time–frequency weighted noisy speech + IEEE Transactions on Audio, Speech, and Language Processing + IEEE + 2011 + 19 + 7 + 10.1109/tasl.2011.2114881 + 2125 + 2136 + + + + + + RixA. W. + BeerendsJ. G. + HollierM. P. + HekstraA. P. + + Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs + 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (cat. No. 01CH37221) + IEEE + 2001 + 2 + 10.1109/icassp.2001.941023 + 749 + 752 + + + + + + TownsJ. + CockerillT. + DahanM. + FosterI. + GaitherK. + GrimshawA. + HazlewoodV. + LathropS. + LifkaD. + PetersonG. D. + RoskiesR. + ScottJ. R. + Wilkins-DiehrN. + + XSEDE: Accelerating scientific discovery + Computing in Science & Engineering + IEEE + 2014 + 16 + 5 + 10.1109/mcse.2014.80 + 62 + 74 + + + + + + NystromN. A. + LevineM. J. + RoskiesR. Z. + ScottJ. R. + + Bridges: A uniquely flexible HPC resource for new communities and data analytics + Proceedings of the 2015 XSEDE conference: Scientific advancements enabled by enhanced cyberinfrastructure + 2015 + 10.1145/2792745.2792775 + 1 + 8 + + + + +
diff --git a/joss.05403/10.21105.joss.05403.pdf b/joss.05403/10.21105.joss.05403.pdf new file mode 100644 index 0000000000..68cd0f086e Binary files /dev/null and b/joss.05403/10.21105.joss.05403.pdf differ diff --git a/joss.05403/media/graphics/UML_Joint.png b/joss.05403/media/graphics/UML_Joint.png new file mode 100644 index 0000000000..7ff8f1761d Binary files /dev/null and b/joss.05403/media/graphics/UML_Joint.png differ diff --git a/joss.05403/media/graphics/UML_SSE.png b/joss.05403/media/graphics/UML_SSE.png new file mode 100644 index 0000000000..d6c281aebb Binary files /dev/null and b/joss.05403/media/graphics/UML_SSE.png differ diff --git a/joss.05403/media/graphics/data_structure.png b/joss.05403/media/graphics/data_structure.png new file mode 100644 index 0000000000..7f0a8e8d0c Binary files /dev/null and b/joss.05403/media/graphics/data_structure.png differ diff --git a/joss.05403/media/graphics/enh_asr_script.png b/joss.05403/media/graphics/enh_asr_script.png new file mode 100644 index 0000000000..27a07f91ed Binary files /dev/null and b/joss.05403/media/graphics/enh_asr_script.png differ diff --git a/joss.05403/media/graphics/enh_script.png b/joss.05403/media/graphics/enh_script.png new file mode 100644 index 0000000000..08c18246d7 Binary files /dev/null and b/joss.05403/media/graphics/enh_script.png differ diff --git a/joss.05403/media/graphics/espnet-SE++.png b/joss.05403/media/graphics/espnet-SE++.png new file mode 100644 index 0000000000..a261bf9577 Binary files /dev/null and b/joss.05403/media/graphics/espnet-SE++.png differ diff --git a/joss.05403/media/graphics/inference.png b/joss.05403/media/graphics/inference.png new file mode 100644 index 0000000000..c7ba59cf88 Binary files /dev/null and b/joss.05403/media/graphics/inference.png differ diff --git a/joss.05403/media/graphics/inference_SSE.png b/joss.05403/media/graphics/inference_SSE.png new file mode 100644 index 0000000000..33cd7a96ba Binary files /dev/null and b/joss.05403/media/graphics/inference_SSE.png differ diff --git a/joss.05403/media/graphics/inference_joint.png b/joss.05403/media/graphics/inference_joint.png new file mode 100644 index 0000000000..7f28d45f2b Binary files /dev/null and b/joss.05403/media/graphics/inference_joint.png differ diff --git a/joss.05403/media/graphics/joint_init.png b/joss.05403/media/graphics/joint_init.png new file mode 100644 index 0000000000..ede56be1dd Binary files /dev/null and b/joss.05403/media/graphics/joint_init.png differ