Skip to content

Commit

Permalink
Merge pull request #186 from pyf98/source
Browse files Browse the repository at this point in the history
OWSM Project Page
  • Loading branch information
sw005320 authored Jan 26, 2024
2 parents 27a1164 + e5118b1 commit 35dea1b
Show file tree
Hide file tree
Showing 2 changed files with 293 additions and 2 deletions.
28 changes: 26 additions & 2 deletions _pages/open-source.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,12 @@ nav: true
order: 5
---

Our lab has been led and participated in the development of several open-source toolkits and datasets. The followings are some selected ones.
Our lab has led and participated in the development of several open-source toolkits, projects, and datasets. Some selected ones are listed below.

### Softwares

<hr />

<table cellspacing="0" cellpadding="0">
<tr>
<td class="col-sm w-25">
Expand Down Expand Up @@ -76,8 +78,29 @@ Our lab has been led and participated in the development of several open-source
</table>
<hr />


### Projects

<hr />

<table cellspacing="0" cellpadding="0">
<tr>
<td class="col-sm w-25">
<a href="{% post_url 2024-01-01-owsm %}">
OWSM
</a>
</td>
<td>
<strong>Open Whisper-style Speech Models</strong> (<strong>OWSM</strong>, pronounced as "awesome") are a series of speech foundation models developed by WAVLab at Carnegie Mellon University. We reproduce Whisper-style training using publicly available data and our open-source toolkit ESPnet. By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training.
</td></tr>
</table>
<hr />


### Datasets

<hr />

<table cellspacing="0" cellpadding="0">
<tr>
<td class="col-sm w-25">
Expand Down Expand Up @@ -182,4 +205,5 @@ Our lab has been led and participated in the development of several open-source
The substantive material of <strong>Totonac</strong> from the northern sierras of Puebla and adjacent areas of Veracruz were compiled starting in 2016 by Jonathan D. Amith and continue to the present as part of a joint effort by Amith and Osbel López Francisco, a native speaker biologist from Zongozotla.
</td></tr>
</table>
<hr />
<hr />

267 changes: 267 additions & 0 deletions _posts/2024-01-01-owsm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
---
layout: post
title: Open Whisper-style Speech Models (OWSM)
description: This is the project page for OWSM models.
date: 2024-01-01 00:00:00-0800
comments: false
---

## Overview

**O**pen **W**hisper-style **S**peech **M**odels (OWSM, pronounced as "awesome") are a series of speech foundation models developed by [WAVLab](https://www.wavlab.org/) at Carnegie Mellon University. We reproduce Whisper-style training using publicly available data and our open-source toolkit [ESPnet](https://github.com/espnet/espnet). By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training.

## Demo

- Gradio demo: [![Static Badge](https://img.shields.io/badge/OWSM-Demo-orange)](https://pyf98-owsm-v3-demo.hf.space)
- Colab notebook: [![Open All Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing)

## Pre-trained models

We publicly release a series of pre-trained models. The training logs are also available for major models. <strong>We recommend using OWSM v3.1 or later versions for better performance and efficiency.</strong>

<table class="table">
<thead>
<tr>
<th>Name</th>
<th>Data (hours)</th>
<th>Encoder</th>
<th>Parameters</th>
<th>Model Link</th>
<th>ESPnet Recipe</th>
</tr>
</thead>
<tbody>
<tr>
<td>OWSM v1</td>
<td>38k</td>
<td>Transformer</td>
<td>272M</td>
<td><a href="https://huggingface.co/espnet/owsm_v1">espnet/owsm_v1</a></td>
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v1/s2t1">egs2/owsm_v1/s2t1</a></td>
</tr>
<tr>
<td>OWSM v2</td>
<td>129k</td>
<td>Transformer</td>
<td>712M</td>
<td><a href="https://huggingface.co/espnet/owsm_v2">espnet/owsm_v2</a></td>
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v2/s2t1">egs2/owsm_v2/s2t1</a></td>
</tr>
<tr>
<td>OWSM v2</td>
<td>129k</td>
<td>E-Branchformer</td>
<td>739M</td>
<td><a href="https://huggingface.co/espnet/owsm_v2_ebranchformer">espnet/owsm_v2_ebranchformer</a></td>
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v2/s2t1">egs2/owsm_v2/s2t1</a></td>
</tr>
<tr>
<td>OWSM v3</td>
<td>180k</td>
<td>Transformer</td>
<td>889M</td>
<td><a href="https://huggingface.co/espnet/owsm_v3">espnet/owsm_v3</a></td>
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3/s2t1">egs2/owsm_v3/s2t1</a></td>
</tr>
<tr>
<td><b>OWSM v3.1 base</b></td>
<td>180k</td>
<td>E-Branchformer</td>
<td>101M</td>
<td><a href="https://huggingface.co/espnet/owsm_v3.1_ebf_base">espnet/owsm_v3.1_ebf_base</a></td>
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/s2t1">egs2/owsm_v3.1/s2t1</a></td>
</tr>
<tr>
<td><b>OWSM v3.1 small</b></td>
<td>180k</td>
<td>E-Branchformer</td>
<td>367M</td>
<td><a href="">Coming soon</a></td>
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/s2t1">egs2/owsm_v3.1/s2t1</a></td>
</tr>
<tr>
<td><b>OWSM v3.1 medium</b></td>
<td>180k</td>
<td>E-Branchformer</td>
<td>1.02B</td>
<td><a href="https://huggingface.co/espnet/owsm_v3.1_ebf">espnet/owsm_v3.1_ebf</a></td>
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3.1/s2t1">egs2/owsm_v3.1/s2t1</a></td>
</tr>
<tr>
<td><b>OWSM v3.1 medium license-free</b></td>
<td>70k</td>
<td>E-Branchformer</td>
<td>1.02B</td>
<td><a href="">Coming soon</a></td>
<td><a href="">Coming soon</a></td>
</tr>
</tbody>
</table>


## Data details

The latest OWSM v3.1 models are trained on a diverse combination of public datasets as listed below.

<details style="margin-bottom:1em;"><summary>OWSM v3.1 training data mixtures</summary>
<ul>
<li>AIDATATANG</li>
<li>AISHELL-1</li>
<li>AMI</li>
<li>Babel</li>
<li>Common Voice</li>
<li>Googlei18n</li>
<li>CoVoST2</li>
<li>Fisher Callhome Spanish</li>
<li>Fisher (Switchboard)</li>
<li>FLEURS</li>
<li>GigaSpeech</li>
<li>GigaST</li>
<li>KsponSpeech</li>
<li>LibriSpeech</li>
<li>MagicData</li>
<li>Multilingual LibriSpeech</li>
<li>MuST-C</li>
<li>ReazonSpeech</li>
<li>Russian Open STT</li>
<li>SPGISpeech</li>
<li>TEDLIUM3</li>
<li>VCTK</li>
<li>VoxForge</li>
<li>VoxPopuli</li>
<li>WenetSpeech</li>
</ul>
</details>

The license-free model is trained on a subset of the above data with "free licenses".

<details style="margin-bottom:1em;"><summary>OWSM v3.1 license-free data</summary>
<ul>
<li>AMI: CC-BY-4.0</li>
<li>Common Voice: CC0-1.0</li>
<li>FLEURS: CC-BY-4.0</li>
<li>KsponSpeech: MIT</li>
<li>LibriSpeech: CC-BY-4.0</li>
<li>Multilingual LibriSpeech: CC-BY-4.0</li>
<li>VCTK: CC-BY-4.0</li>
</ul>
</details>


## Inference

Similar to other ESPnet models, the pre-trained OWSM models can be easily downloaded and used in a python script. Below are some examples.

### Language Identification

We pass the Hugging Face model tag when initializing `Speech2Language`. The model will be automatically downloaded from Hugging Face to a local cache directory.

```python
from espnet2.bin.s2t_inference_language import Speech2Language
s2l = Speech2Language.from_pretrained(
model_tag="espnet/owsm_v3.1_ebf",
device="cuda",
nbest=3, # return nbest prediction and probability
)

import soundfile as sf
speech, rate = sf.read("audio.wav")

result = s2l(speech)
print(result)
# list of tuples (language, probability)
# [('<eng>', 0.9994348883628845), ('<jpn>', 0.00010286537144565955), ('<rus>', 6.185896199895069e-05)]
```

### Speech Recognition or Translation

We use `Speech2Text` for speech recognition or translation. We also pass the model tag so that the model can be automatically downloaded. When initializing this object, we set the default values for `lang_sym`, `task_sym` and `predict_time`. These variables can be overwritten later, which provides more flexibility. Note that the language must be known to use this functionality. If it is unknown, one can first perform language identification and then recognition or translation.

```python
from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
model_tag="espnet/owsm_v3.1_ebf",
device="cuda",
beam_size=5,
ctc_weight=0.0,
maxlenratio=0.0,
# below are default values which can be overwritten in __call__
lang_sym="<eng>",
task_sym="<asr>",
predict_time=False,
)

import soundfile as sf
speech, rate = sf.read("audio.wav")


result = s2t(speech)[0][-2]

# an optional text prompt can be passed
result = s2t(
speech,
text_prev="this is an optional prompt"
)[0][-2]

# lang_sym, task_sym, predict_time can be overwritten
result = s2t(
speech,
lang_sym="<eng>",
task_sym="<st_zho>", # translation into Chinese
predict_time=True,
)[0][-2]
```


### Long-form Speech Recognition or Translation

OWSM processes an entire audio recording in a chunk-by-chunk manner. Each chunk has a fixed length of 30s. The chunk is shifted based on the predicted timestamps. We still use `Speech2Text` but we call its `decode_long` method.

```python
from espnet2.bin.s2t_inference import Speech2Text
s2t = Speech2Text.from_pretrained(
model_tag="espnet/owsm_v3.1_ebf",
device="cuda",
beam_size=5,
ctc_weight=0.0,
maxlenratio=0.0,
# below are default values which can be overwritten in __call__
lang_sym="<eng>",
task_sym="<asr>",
)

import soundfile as sf
speech, rate = sf.read("covid.wav")

result = s2t.decode_long(speech)
# list of tuples (start_time, end_time, text)
```


## Fine-tuning on custom data

Coming soon!


## Papers

Please cite our papers if you use OWSM in your project.

- ASRU 2023: [Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data](https://arxiv.org/abs/2309.13876)


We also collect other papers related to OWSM. <strong>Please contact Yifan Peng ([email protected]) if you use OWSM in your work and want to include it here.</strong>

<details><summary>OWSM applications</summary>
<ul>
<li>ASRU 2023 SPARKS Workshop: <a href="https://drive.google.com/file/d/18UCCNZssZGTh92lKt7vU7IlmbW_tuFGy/view?usp=sharing">SLUE-PERB: A Spoken Language Understanding Performance Benchmark and Toolkit</a></li>
</ul>
</details>

<details><summary>Foundational work used by OWSM</summary>
<ul>
<li>INTERSPEECH 2023: <a href="https://arxiv.org/abs/2305.11073">A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks</a></li>
<li>SLT 2022: <a href="https://proceedings.mlr.press/v162/peng22a.html">E-Branchformer: Branchformer with Enhanced merging for speech recognition</a></li>
<li>ICML 2022: <a href="https://proceedings.mlr.press/v162/peng22a.html">Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding</a></li>
</ul>
</details>

0 comments on commit 35dea1b

Please sign in to comment.