Code for the paper: "SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation" published at IWSLT 2024.
To run the agent, please make sure that SimulEval v1.1.0 and HuggingFace Transformers are installed.
In the case of 💬 Inference using docker, use commit
f1f5b9a69a47496630aa43605f1bd46e5484a2f4
for SimulEval.
Please, set --source
, and --target
as described in the
Fairseq Simultaneous Translation repository:
${LIST_OF_AUDIO}
is the list of audio paths and ${TGT_FILE}
the segment-wise references in the
target language.
Set ${TGT_LANG}
as the target language code in 3 characters. The list of supported language
codes is
available here.
For the source language, no language code has to be specified.
Depending on the target language, set ${LATENCY_UNIT}
to either word
(e.g., for German) or
char
(e.g., for Japanese), and ${BLEU_TOKENIZER}
to either 13a
(i.e., the standard sacreBLEU
tokenizer used, for example, to evaluate German) or char
(e.g., to evaluate character-level
languages such as Chinese or Japanese).
The simultaneous inference of SimulSeamless is based on
AlignAtt, thus the f parameter (${FRAME}
) and the
layer from which to extract the attention scores (${LAYER}
) have to be set accordingly.
To replicate the results obtained to achieve 2 seconds of latency (measured by AL) on the test sets used by the IWSLT 2024 Simultaneous track, use the following values:
- en-de:
${TGT_LANG}=deu
,${FRAME}=6
,${LAYER}=3
,${SEG_SIZE}=1000
- en-ja:
${TGT_LANG}=jpn
,${FRAME}=1
,${LAYER}=0
,${SEG_SIZE}=400
- en-zh:
${TGT_LANG}=cmn
,${FRAME}=1
,${LAYER}=3
,${SEG_SIZE}=800
- cs-en:
${TGT_LANG}=eng
,${FRAME}=9
,${LAYER}=3
,${SEG_SIZE}=1000
❗️Please notice that ${FRAME}
can be adjusted to achieve lower/higher latency.
The SimulSeamless can be run with:
simuleval \
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_alignatt_seamlessm4t.AlignAttSeamlessS2T \
--source ${LIST_OF_AUDIO} \
--target ${TGT_FILE} \
--data-bin ${DATA_ROOT} \
--model-size medium --target-language ${TGT_LANG} \
--extract-attn-from-layer ${LAYER} --num-beams 5 \
--frame-num ${FRAME} \
--source-segment-size ${SEG_SIZE} \
--quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
--eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
--output ${OUT_DIR} \
--device cuda:0
If not already stored in your system, the SeamlessM4T model will be downloaded automatically when
running the script. The output will be saved in ${OUT_DIR}
.
We suggest to run the inference using a GPU to speed up the process but the system can be run on any device (e.g., CPU) supported by SimulEval and HuggingFace.
To run SimulSeamless using docker, as required by the IWSLT 2024 Simultaneous track, follow the steps below:
- Download the docker file simulseamless.tar
- Load the docker image:
docker load -i simulseamless.tar
- Start the SimulEval standalone with GPU enabled:
docker run -e TGTLANG=${TGT_LANG} -e FRAME=${FRAME} -e LAYER=${LAYER} \
-e BLEU_TOKENIZER=${BLEU_TOKENIZER} -e LATENCY_UNIT=${LATENCY_UNIT} \
-e DEV=cuda:0 --gpus all --shm-size 32G \
-p 2024:2024 simulseamless:latest
- Start the remote evaluation with:
simuleval \
--remote-eval --remote-port 2024 \
--source ${LIST_OF_AUDIO} --target ${TGT_FILE} \
--source-type speech --target-type text \
--source-segment-size ${SEG_SIZE} \
--eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
--output ${OUT_DIR}
To set, ${TGT_LANG}
, ${FRAME}
, ${LAYER}
, ${BLEU_TOKENIZER}
, ${LATENCY_UNIT}
,
${LIST_OF_AUDIO}
, ${TGT_FILE}
, ${SEG_SIZE}
, and ${OUT_DIR}
refer to
🤖 Inference using your environment.
To recreate the docker images, follow the steps below.
- Download SimulEval and this repository.
- Create a
Dockerfile
with the following content:
FROM python:3.9
RUN pip install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0 --extra-index-url https://download.pytorch.org/whl/cu113
ADD /SimulEval /SimulEval
WORKDIR /SimulEval
RUN pip install -e .
WORKDIR ../
ADD /fbk-fairseq /fbk-fairseq
WORKDIR /fbk-fairseq
RUN pip install -e .
RUN pip install -r speech_requirements.txt
WORKDIR ../
RUN pip install sentencepiece
RUN pip install transformers
ENTRYPOINT simuleval --standalone --remote-port 2024 \
--agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_alignatt_seamlessm4t.AlignAttSeamlessS2T \
--model-size medium --num-beams 5 --user-dir fbk-fairseq/examples \
--target-language $TGTLANG --frame-num $FRAME --extract-attn-from-layer $LAYER --device $DEV \
--sacrebleu-tokenizer ${BLEU_TOKENIZER} --eval-latency-unit ${LATENCY_UNIT}
- Build the docker image:
docker build -t simulseamless .
- Save the docker image:
docker save -o simulseamless.tar simulseamless:latest
@inproceedings{papi-et-al-2024-simulseamless,
title = "SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation",
author = {Papi, Sara and Gaido, Marco and Negri, Matteo and Bentivogli, Luisa},
booktitle = "Proceedings of the 21th International Conference on Spoken Language Translation (IWSLT)",
year = "2024",
address = "Bangkok, Thailand",
}