-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #186 from pyf98/source
OWSM Project Page
- Loading branch information
Showing
2 changed files
with
293 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,267 @@ | ||
--- | ||
layout: post | ||
title: Open Whisper-style Speech Models (OWSM) | ||
description: This is the project page for OWSM models. | ||
date: 2024-01-01 00:00:00-0800 | ||
comments: false | ||
--- | ||
|
||
## Overview | ||
|
||
**O**pen **W**hisper-style **S**peech **M**odels (OWSM, pronounced as "awesome") are a series of speech foundation models developed by [WAVLab](https://www.wavlab.org/) at Carnegie Mellon University. We reproduce Whisper-style training using publicly available data and our open-source toolkit [ESPnet](https://github.com/espnet/espnet). By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training. | ||
|
||
## Demo | ||
|
||
- Gradio demo: [![Static Badge](https://img.shields.io/badge/OWSM-Demo-orange)](https://pyf98-owsm-v3-demo.hf.space) | ||
- Colab notebook: [![Open All Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing) | ||
|
||
## Pre-trained models | ||
|
||
We publicly release a series of pre-trained models. The training logs are also available for major models. <strong>We recommend using OWSM v3.1 or later versions for better performance and efficiency.</strong> | ||
|
||
<table class="table"> | ||
<thead> | ||
<tr> | ||
<th>Name</th> | ||
<th>Data (hours)</th> | ||
<th>Encoder</th> | ||
<th>Parameters</th> | ||
<th>Model Link</th> | ||
<th>ESPnet Recipe</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>OWSM v1</td> | ||
<td>38k</td> | ||
<td>Transformer</td> | ||
<td>272M</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v1">espnet/owsm_v1</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v1/s2t1">egs2/owsm_v1/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td>OWSM v2</td> | ||
<td>129k</td> | ||
<td>Transformer</td> | ||
<td>712M</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v2">espnet/owsm_v2</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v2/s2t1">egs2/owsm_v2/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td>OWSM v2</td> | ||
<td>129k</td> | ||
<td>E-Branchformer</td> | ||
<td>739M</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v2_ebranchformer">espnet/owsm_v2_ebranchformer</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v2/s2t1">egs2/owsm_v2/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td>OWSM v3</td> | ||
<td>180k</td> | ||
<td>Transformer</td> | ||
<td>889M</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v3">espnet/owsm_v3</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3/s2t1">egs2/owsm_v3/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td><b>OWSM v3.1 base</b></td> | ||
<td>180k</td> | ||
<td>E-Branchformer</td> | ||
<td>101M</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v3.1_ebf_base">espnet/owsm_v3.1_ebf_base</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/s2t1">egs2/owsm_v3.1/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td><b>OWSM v3.1 small</b></td> | ||
<td>180k</td> | ||
<td>E-Branchformer</td> | ||
<td>367M</td> | ||
<td><a href="">Coming soon</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/s2t1">egs2/owsm_v3.1/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td><b>OWSM v3.1 medium</b></td> | ||
<td>180k</td> | ||
<td>E-Branchformer</td> | ||
<td>1.02B</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v3.1_ebf">espnet/owsm_v3.1_ebf</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/s2t1">egs2/owsm_v3.1/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td><b>OWSM v3.1 medium license-free</b></td> | ||
<td>70k</td> | ||
<td>E-Branchformer</td> | ||
<td>1.02B</td> | ||
<td><a href="">Coming soon</a></td> | ||
<td><a href="">Coming soon</a></td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
|
||
## Data details | ||
|
||
The latest OWSM v3.1 models are trained on a diverse combination of public datasets as listed below. | ||
|
||
<details style="margin-bottom:1em;"><summary>OWSM v3.1 training data mixtures</summary> | ||
<ul> | ||
<li>AIDATATANG</li> | ||
<li>AISHELL-1</li> | ||
<li>AMI</li> | ||
<li>Babel</li> | ||
<li>Common Voice</li> | ||
<li>Googlei18n</li> | ||
<li>CoVoST2</li> | ||
<li>Fisher Callhome Spanish</li> | ||
<li>Fisher (Switchboard)</li> | ||
<li>FLEURS</li> | ||
<li>GigaSpeech</li> | ||
<li>GigaST</li> | ||
<li>KsponSpeech</li> | ||
<li>LibriSpeech</li> | ||
<li>MagicData</li> | ||
<li>Multilingual LibriSpeech</li> | ||
<li>MuST-C</li> | ||
<li>ReazonSpeech</li> | ||
<li>Russian Open STT</li> | ||
<li>SPGISpeech</li> | ||
<li>TEDLIUM3</li> | ||
<li>VCTK</li> | ||
<li>VoxForge</li> | ||
<li>VoxPopuli</li> | ||
<li>WenetSpeech</li> | ||
</ul> | ||
</details> | ||
|
||
The license-free model is trained on a subset of the above data with "free licenses". | ||
|
||
<details style="margin-bottom:1em;"><summary>OWSM v3.1 license-free data</summary> | ||
<ul> | ||
<li>AMI: CC-BY-4.0</li> | ||
<li>Common Voice: CC0-1.0</li> | ||
<li>FLEURS: CC-BY-4.0</li> | ||
<li>KsponSpeech: MIT</li> | ||
<li>LibriSpeech: CC-BY-4.0</li> | ||
<li>Multilingual LibriSpeech: CC-BY-4.0</li> | ||
<li>VCTK: CC-BY-4.0</li> | ||
</ul> | ||
</details> | ||
|
||
|
||
## Inference | ||
|
||
Similar to other ESPnet models, the pre-trained OWSM models can be easily downloaded and used in a python script. Below are some examples. | ||
|
||
### Language Identification | ||
|
||
We pass the Hugging Face model tag when initializing `Speech2Language`. The model will be automatically downloaded from Hugging Face to a local cache directory. | ||
|
||
```python | ||
from espnet2.bin.s2t_inference_language import Speech2Language | ||
s2l = Speech2Language.from_pretrained( | ||
model_tag="espnet/owsm_v3.1_ebf", | ||
device="cuda", | ||
nbest=3, # return nbest prediction and probability | ||
) | ||
|
||
import soundfile as sf | ||
speech, rate = sf.read("audio.wav") | ||
|
||
result = s2l(speech) | ||
print(result) | ||
# list of tuples (language, probability) | ||
# [('<eng>', 0.9994348883628845), ('<jpn>', 0.00010286537144565955), ('<rus>', 6.185896199895069e-05)] | ||
``` | ||
|
||
### Speech Recognition or Translation | ||
|
||
We use `Speech2Text` for speech recognition or translation. We also pass the model tag so that the model can be automatically downloaded. When initializing this object, we set the default values for `lang_sym`, `task_sym` and `predict_time`. These variables can be overwritten later, which provides more flexibility. Note that the language must be known to use this functionality. If it is unknown, one can first perform language identification and then recognition or translation. | ||
|
||
```python | ||
from espnet2.bin.s2t_inference import Speech2Text | ||
s2t = Speech2Text.from_pretrained( | ||
model_tag="espnet/owsm_v3.1_ebf", | ||
device="cuda", | ||
beam_size=5, | ||
ctc_weight=0.0, | ||
maxlenratio=0.0, | ||
# below are default values which can be overwritten in __call__ | ||
lang_sym="<eng>", | ||
task_sym="<asr>", | ||
predict_time=False, | ||
) | ||
|
||
import soundfile as sf | ||
speech, rate = sf.read("audio.wav") | ||
|
||
|
||
result = s2t(speech)[0][-2] | ||
|
||
# an optional text prompt can be passed | ||
result = s2t( | ||
speech, | ||
text_prev="this is an optional prompt" | ||
)[0][-2] | ||
|
||
# lang_sym, task_sym, predict_time can be overwritten | ||
result = s2t( | ||
speech, | ||
lang_sym="<eng>", | ||
task_sym="<st_zho>", # translation into Chinese | ||
predict_time=True, | ||
)[0][-2] | ||
``` | ||
|
||
|
||
### Long-form Speech Recognition or Translation | ||
|
||
OWSM processes an entire audio recording in a chunk-by-chunk manner. Each chunk has a fixed length of 30s. The chunk is shifted based on the predicted timestamps. We still use `Speech2Text` but we call its `decode_long` method. | ||
|
||
```python | ||
from espnet2.bin.s2t_inference import Speech2Text | ||
s2t = Speech2Text.from_pretrained( | ||
model_tag="espnet/owsm_v3.1_ebf", | ||
device="cuda", | ||
beam_size=5, | ||
ctc_weight=0.0, | ||
maxlenratio=0.0, | ||
# below are default values which can be overwritten in __call__ | ||
lang_sym="<eng>", | ||
task_sym="<asr>", | ||
) | ||
|
||
import soundfile as sf | ||
speech, rate = sf.read("covid.wav") | ||
|
||
result = s2t.decode_long(speech) | ||
# list of tuples (start_time, end_time, text) | ||
``` | ||
|
||
|
||
## Fine-tuning on custom data | ||
|
||
Coming soon! | ||
|
||
|
||
## Papers | ||
|
||
Please cite our papers if you use OWSM in your project. | ||
|
||
- ASRU 2023: [Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data](https://arxiv.org/abs/2309.13876) | ||
|
||
|
||
We also collect other papers related to OWSM. <strong>Please contact Yifan Peng ([email protected]) if you use OWSM in your work and want to include it here.</strong> | ||
|
||
<details><summary>OWSM applications</summary> | ||
<ul> | ||
<li>ASRU 2023 SPARKS Workshop: <a href="https://drive.google.com/file/d/18UCCNZssZGTh92lKt7vU7IlmbW_tuFGy/view?usp=sharing">SLUE-PERB: A Spoken Language Understanding Performance Benchmark and Toolkit</a></li> | ||
</ul> | ||
</details> | ||
|
||
<details><summary>Foundational work used by OWSM</summary> | ||
<ul> | ||
<li>INTERSPEECH 2023: <a href="https://arxiv.org/abs/2305.11073">A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks</a></li> | ||
<li>SLT 2022: <a href="https://proceedings.mlr.press/v162/peng22a.html">E-Branchformer: Branchformer with Enhanced merging for speech recognition</a></li> | ||
<li>ICML 2022: <a href="https://proceedings.mlr.press/v162/peng22a.html">Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding</a></li> | ||
</ul> | ||
</details> |