-
Notifications
You must be signed in to change notification settings - Fork 52
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
97 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
--- | ||
layout: post | ||
title: Open Whisper-style Speech Models (OWSM) | ||
description: This is the project page for OWSM models. | ||
date: 2024-01-01 00:00:00-0800 | ||
comments: false | ||
--- | ||
|
||
## Overview | ||
|
||
The **O**pen **W**hisper-style **S**peech **M**odels (OWSM, pronounced as "awesome") are a series of speech foundation models developed by [WAVLab](https://www.wavlab.org/) at Carnegie Mellon University. It reproduces Whisper-style training using publicly available data and an open-source toolkit [ESPnet](https://github.com/espnet/espnet). By publicly releasing data preparation scripts, training and inference code, pre-trained model weights and training logs, we aim to promote transparency and open science in large-scale speech pre-training. | ||
|
||
## News | ||
|
||
|
||
## Demo pages | ||
|
||
- Gradio demo: [![Static Badge](https://img.shields.io/badge/OWSM-Demo-orange)](https://pyf98-owsm-v3-demo.hf.space) | ||
- Colab notebook: [![Open All Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1zKI3ZY_OtZd6YmVeED6Cxy1QwT1mqv9O?usp=sharing) | ||
|
||
|
||
## Pre-trained models | ||
|
||
We have released various pre-trained models. The training logs are also available for major models. | ||
|
||
<table class="table"> | ||
<thead> | ||
<tr> | ||
<th>Name</th> | ||
<th>Encoder</th> | ||
<th>Parameters</th> | ||
<th>Data (hours)</th> | ||
<th>Model Link</th> | ||
<th>ESPnet Recipe</th> | ||
</tr> | ||
</thead> | ||
<tbody> | ||
<tr> | ||
<td>OWSM v1</td> | ||
<td>Transformer</td> | ||
<td>272M</td> | ||
<td>38k</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v1">espnet/owsm_v1</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v1/s2t1">egs2/owsm_v1/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td>OWSM v2</td> | ||
<td>Transformer</td> | ||
<td>712M</td> | ||
<td>129k</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v2">espnet/owsm_v2</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v2/s2t1">egs2/owsm_v2/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td>OWSM v2</td> | ||
<td>E-Branchformer</td> | ||
<td>739M</td> | ||
<td>129k</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v2_ebranchformer">espnet/owsm_v2_ebranchformer</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v2/s2t1">egs2/owsm_v2/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td>OWSM v3</td> | ||
<td>Transformer</td> | ||
<td>889M</td> | ||
<td>180k</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v3">espnet/owsm_v3</a></td> | ||
<td><a href="https://github.com/espnet/espnet/tree/master/egs2/owsm_v3/s2t1">egs2/owsm_v3/s2t1</a></td> | ||
</tr> | ||
<tr> | ||
<td>OWSM v3.1</td> | ||
<td>E-Branchformer</td> | ||
<td>1.02B</td> | ||
<td>180k</td> | ||
<td><a href="https://huggingface.co/espnet/owsm_v3.1_ebf">espnet/owsm_v3.1_ebf</a></td> | ||
<td>TBD</td> | ||
</tr> | ||
</tbody> | ||
</table> | ||
|
||
|
||
## Inference | ||
|
||
### Language Identification | ||
|
||
### Speech Recognition | ||
|
||
### Speech Translation | ||
|
||
### Long-form Speech Recognition or Translation | ||
|
||
|
||
## Fine-tuning | ||
|
||
|
||
## Citations | ||
|