Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation.
Whisper was proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. The original code repository can be found here.
There are ten variations of the models. They are available on the Hugging Face Hub. The checkpoints are summarised in the following table with links to the models on the Hub:
Size | Parameters | English-only | Multilingual |
---|---|---|---|
tiny | 39 M | ✓ | ✓ |
base | 74 M | ✓ | ✓ |
small | 244 M | ✓ | ✓ |
medium | 769 M | ✓ | ✓ |
large | 1550 M | x | ✓ |
large-v2 | 1550 M | x | ✓ |