From 35b41205ab6f3013ed76149c0a3a9965e67aa0a4 Mon Sep 17 00:00:00 2001 From: spapi Date: Mon, 17 Jun 2024 19:35:25 +0200 Subject: [PATCH] [!202][RELEASE] StreamAtt # Which work do we release? "StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection" published at ACL 2024. # What changes does this release refer to? 800277d6dc8295145a504f1d56c3fc354e8381a6 0396faf27244901498147f04b3f73fa2c7d52951 af71dc0030884da92a93d45adb55020443d15381 77ea833521a1fd54159a12fab8ef67d13bbecced --- README.md | 3 + fbk_works/STREAMATT_STREAMLAAL.md | 145 ++++++++++++++++++++++++++++++ 2 files changed, 148 insertions(+) create mode 100644 fbk_works/STREAMATT_STREAMLAAL.md diff --git a/README.md b/README.md index 872f5fa6..0b16f88d 100644 --- a/README.md +++ b/README.md @@ -5,6 +5,7 @@ Dedicated README for each work can be found in the `fbk_works` directory. ### 2024 + - [[ACL 2024] **StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection**](fbk_works/STREAMATT_STREAMLAAL.md) - [[ACL 2024] **SBAAM! Eliminating Transcript Dependency in Automatic Subtitling**](fbk_works/SBAAM.md) - [[ACL 2024] **When Good and Reproducible Results are a Giant with Feet of Clay: The Importance of Software Quality in NLP**](fbk_works/BUGFREE_CONFORMER.md) - [[LREC-COLING 2024] **How do Hyenas deal with Human Speech? Speech Recognition and Translation with ConfHyena**](fbk_works/HYENA_COLING2024.md) @@ -37,6 +38,8 @@ Dedicated README for each work can be found in the `fbk_works` directory. If using this repository, please acknowledge the related paper(s) citing them. Bibtex citations are available for each work in the dedicated README file. +## Installation + To install the repository, do: ``` diff --git a/fbk_works/STREAMATT_STREAMLAAL.md b/fbk_works/STREAMATT_STREAMLAAL.md new file mode 100644 index 00000000..9fe3cc89 --- /dev/null +++ b/fbk_works/STREAMATT_STREAMLAAL.md @@ -0,0 +1,145 @@ +# StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection (ACL 2024) +![ACL Anthology](https://img.shields.io/badge/anthology-brightgreen?logo=data%3Aimage%2Fsvg%2Bxml%3Bbase64%2CPD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiIHN0YW5kYWxvbmU9Im5vIj8%2BCjwhLS0gQ3JlYXRlZCB3aXRoIElua3NjYXBlIChodHRwOi8vd3d3Lmlua3NjYXBlLm9yZy8pIC0tPgo8c3ZnCiAgIHhtbG5zOnN2Zz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciCiAgIHhtbG5zPSJodHRwOi8vd3d3LnczLm9yZy8yMDAwL3N2ZyIKICAgdmVyc2lvbj0iMS4wIgogICB3aWR0aD0iNjgiCiAgIGhlaWdodD0iNjgiCiAgIGlkPSJzdmcyIj4KICA8ZGVmcwogICAgIGlkPSJkZWZzNCIgLz4KICA8cGF0aAogICAgIGQ9Ik0gNDEuOTc3NTUzLC0yLjg0MjE3MDllLTAxNCBDIDQxLjk3NzU1MywxLjc2MTc4IDQxLjk3NzU1MywxLjQ0MjExIDQxLjk3NzU1MywzLjAxNTggTCA3LjQ4NjkwNTQsMy4wMTU4IEwgMCwzLjAxNTggTCAwLDEwLjUwMDc5IEwgMCwzOC40Nzg2NyBMIDAsNDYgTCA3LjQ4NjkwNTQsNDYgTCA0OS41MDA4MDIsNDYgTCA1Ni45ODc3MDgsNDYgTCA2OCw0NiBMIDY4LDMwLjk5MzY4IEwgNTYuOTg3NzA4LDMwLjk5MzY4IEwgNTYuOTg3NzA4LDEwLjUwMDc5IEwgNTYuOTg3NzA4LDMuMDE1OCBDIDU2Ljk4NzcwOCwxLjQ0MjExIDU2Ljk4NzcwOCwxLjc2MTc4IDU2Ljk4NzcwOCwtMi44NDIxNzA5ZS0wMTQgTCA0MS45Nzc1NTMsLTIuODQyMTcwOWUtMDE0IHogTSAxNS4wMTAxNTUsMTcuOTg1NzggTCA0MS45Nzc1NTMsMTcuOTg1NzggTCA0MS45Nzc1NTMsMzAuOTkzNjggTCAxNS4wMTAxNTUsMzAuOTkzNjggTCAxNS4wMTAxNTUsMTcuOTg1NzggeiAiCiAgICAgc3R5bGU9ImZpbGw6I2VkMWMyNDtmaWxsLW9wYWNpdHk6MTtmaWxsLXJ1bGU6ZXZlbm9kZDtzdHJva2U6bm9uZTtzdHJva2Utd2lkdGg6MTIuODk1NDExNDk7c3Ryb2tlLWxpbmVjYXA6YnV0dDtzdHJva2UtbGluZWpvaW46bWl0ZXI7c3Ryb2tlLW1pdGVybGltaXQ6NDtzdHJva2UtZGFzaGFycmF5Om5vbmU7c3Ryb2tlLWRhc2hvZmZzZXQ6MDtzdHJva2Utb3BhY2l0eToxIgogICAgIHRyYW5zZm9ybT0idHJhbnNsYXRlKDAsIDExKSIKICAgICBpZD0icmVjdDIxNzgiIC8%2BCjwvc3ZnPgo%3D&label=ACL&labelColor=white&color=red) + +Code for the paper: ["StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection"](https://arxiv.org/abs/2406.06097) published at the ACL 2024 main conference. + +## ๐Ÿ“Ž Requirements +To run the agent, please make sure that [this repository](../README.md#installation) and +[SimulEval v1.1.0](https://github.com/facebookresearch/SimulEval/commit/ec759d124307096dbbf6c3269d2ed652cc15fbdd) +are installed. + +Create a textual file (e.g., `src_audiopath_list.txt`) containing the list of paths to the audio +files (one path per line for each file), which, differently from SimulST, are __not__ split into +segments but are the entire speeches. +Specifically, in the case of the MuST-C dataset used in the paper, the file contains the paths to +the entire TED talk files, similar to the following: +```txt +${AUDIO_DIR}/ted_1096.wav +${AUDIO_DIR}/ted_1102.wav +${AUDIO_DIR}/ted_1104.wav +${AUDIO_DIR}/ted_1114.wav +${AUDIO_DIR}/ted_1115.wav +... +``` +Instead, as target file `translations.txt`, it can either be used a dummy file or the sentences +concatenation, one line for each talk. +However, for the evaluation of already segmented test sets, such as in MuST-C, we will not need +these references, and we will evaluate directly from the segmented translations provided with the +dataset, as described in [Evaluation with StreamLAAL](#-evaluation-streamlaal). + +## ๐Ÿ“Œ Pre-trained Offline models +โš ๏ธ The offline ST models used for the baseline, AlignAtt, and StreamAtt are the same and already available at +the [AlignAtt release webpage](ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md#-pre-trained-offline-models)โ— + +## ๐Ÿค– Streaming Inference: *StreamAtt* +For the streaming inference, set `--config` and `--model-path` as, respectively, the config file +and the model checkpoint downloaded in the +[Pre-trained Offline models](#-pre-trained-offline-models) step. +As `--source` and `--target`, please use the files `src_audiopath_list.txt` and `translations.txt` +created in the [Requirements](#-requirements) step. + +The output will be saved in `--output`. + +### โญ StreamAtt +For the ***Hypothesis Selection*** (based on AlignAtt), please set `--frame-num` as the value of +*f* used for the inference (`f=[2, 4, 6, 8]`, in the paper). + +Depending on the ***Textual History Selection*** ([Fixed Words](#fixed-words) or [Punctuation](#punctuation)), run the following command: + +#### Fixed Words +```bash +simuleval \ + --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \ + --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \ + --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedWordsHistorySelection \ + --source ${SRC_LIST_OF_AUDIO} \ + --target ${TGT_FILE} \ + --data-bin ${DATA_ROOT} \ + --config config.yaml \ + --model-path checkpoint.pt \ + --source-segment-size 1000 \ + --extract-attn-from-layer 3 \ + --frame-num ${FRAME} \ + --history-words 20 \ + --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \ + --device cuda:0 +``` + +#### Punctuation +```bash +simuleval \ + --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \ + --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \ + --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.PunctuationHistorySelection \ + --source ${SRC_LIST_OF_AUDIO} \ + --target ${TGT_FILE} \ + --data-bin ${DATA_ROOT} \ + --config config.yaml \ + --model-path checkpoint.pt \ + --source-segment-size 1000 \ + --extract-attn-from-layer 3 \ + --frame-num ${FRAME} \ + --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \ + --device cuda:0 +``` + +### โญ Baseline and Upperbound + +To run the baseline, execute the following command: +```bash +simuleval \ + --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.streaming_st_agent.StreamingSTAgent \ + --simulst-agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_offline_alignatt.AlignAttSTAgent \ + --history-selection-method examples.speech_to_text.simultaneous_translation.agents.v1_1.streaming.text_first_history_selection.FixedAudioHistorySelection \ + --source ${SRC_LIST_OF_AUDIO} \ + --target ${TGT_FILE} \ + --data-bin ${DATA_ROOT} \ + --config config.yaml \ + --model-path checkpoint.pt \ + --source-segment-size 1000 \ + --extract-attn-from-layer 3 \ + --frame-num ${FRAME} \ + --history-words 20 \ + --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware --output ${OUT_DIR} \ + --device cuda:0 +``` + +For the simultaneous inference with AlignAtt (the upperbound presented in the paper), please refer +to the [AlignAtt README](ALIGNATT_SIMULST_AGENT_INTERSPEECH2023.md#-inference). + +## ๐Ÿ’ฌ Evaluation: *StreamLAAL* +To evaluate the streaming outputs, download and extract the +[mwerSegmenter](https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz) in the +`${MWERSEGMENTER_DIR}` folder, and run the following command: +```bash +export MWERSEGMENTER_ROOT=${MWERSEGMENTER_DIR} + +streamLAAL --simuleval-instances ${SIMULEVAL_INSTANCES} \ + --reference ${REFERENCE_TEXTS} \ + --audio-yaml ${AUDIO_YAML} \ + --sacrebleu-tokenizer ${SACREBLEU_TOKENIZER} \ + --latency-unit ${LATENCY_UNIT} +``` +where `${SIMULEVAL_INSTANCES}` is the output `instances.log` produced by the agent in the previous +step, `${REFERENCE_TEXTS}` are the textual references in the target language (one line for each +segment), `${AUDIO_YAML}` is the yaml file containing the original audio segmentation, +`${SACREBLEU_TOKENIZER}` is the [sacreBLEU](https://github.com/mjpost/sacrebleu) tokenizer used for +the quality evaluation (defaults to `13a`), and `${LATENCY_UNIT}` is the unit used for the latency +computation (either `word` or `char`, defaults to `word`, the unit used in the paper). + +If invoking `streamLAAL` does not work, please include the FBK-fairseq directory +(`${FBK_FAIRSEQ_DIR}`) in the `PYTHONPATH` (`export PYTHONPATH=${FBK_FAIRSEQ_DIR}:$PYTHONPATH`) or +call it explicitly by running +`python ${FBK_FAIRSEQ_DIR}/examples/speech_to_text/simultaneous_translation/scripts/stream_laal.py`. + + +## ๐Ÿ“Citation +```bibtex +@inproceedings{papi-et-al-2024-streamatt, +title = {{StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection}}, +author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Bentivogli, Luisa}, +booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", +year = {2024}, +address = "Bangkok, Thailand", +} +```