Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UniSpeech and UniSpeechSat models #624

Merged
merged 3 commits into from
Mar 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,6 +341,8 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
Expand Down
2 changes: 2 additions & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@
1. **[T5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)** (from Google AI) released in the repository [google-research/text-to-text-transfer-transformer](https://github.com/google-research/text-to-text-transfer-transformer/blob/main/released_checkpoints.md#t511) by Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu.
1. **[Table Transformer](https://huggingface.co/docs/transformers/model_doc/table-transformer)** (from Microsoft Research) released with the paper [PubTables-1M: Towards Comprehensive Table Extraction From Unstructured Documents](https://arxiv.org/abs/2110.00061) by Brandon Smock, Rohith Pesala, Robin Abraham.
1. **[TrOCR](https://huggingface.co/docs/transformers/model_doc/trocr)** (from Microsoft), released together with the paper [TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models](https://arxiv.org/abs/2109.10282) by Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei.
1. **[UniSpeech](https://huggingface.co/docs/transformers/model_doc/unispeech)** (from Microsoft Research) released with the paper [UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data](https://arxiv.org/abs/2101.07597) by Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang.
1. **[UniSpeechSat](https://huggingface.co/docs/transformers/model_doc/unispeech-sat)** (from Microsoft Research) released with the paper [UNISPEECH-SAT: UNIVERSAL SPEECH REPRESENTATION LEARNING WITH SPEAKER AWARE PRE-TRAINING](https://arxiv.org/abs/2110.05752) by Sanyuan Chen, Yu Wu, Chengyi Wang, Zhengyang Chen, Zhuo Chen, Shujie Liu, Jian Wu, Yao Qian, Furu Wei, Jinyu Li, Xiangzhan Yu.
1. **[Vision Transformer (ViT)](https://huggingface.co/docs/transformers/model_doc/vit)** (from Google AI) released with the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby.
1. **[ViTMatte](https://huggingface.co/docs/transformers/model_doc/vitmatte)** (from HUST-VL) released with the paper [ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers](https://arxiv.org/abs/2305.15272) by Jingfeng Yao, Xinggang Wang, Shusheng Yang, Baoyuan Wang.
1. **[VITS](https://huggingface.co/docs/transformers/model_doc/vits)** (from Kakao Enterprise) released with the paper [Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech](https://arxiv.org/abs/2106.06103) by Jaehyeon Kim, Jungil Kong, Juhee Son.
Expand Down
16 changes: 15 additions & 1 deletion scripts/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,27 @@
'per_channel': False,
'reduce_range': False,
},
'wav2vec2': {
'per_channel': False,
'reduce_range': False,
},
'unispeech': {
'per_channel': False,
'reduce_range': False,
},
'unispeech-sat': {
'per_channel': False,
'reduce_range': False,
},
}

MODELS_WITHOUT_TOKENIZERS = [
'wav2vec2',
'wav2vec2-bert',
'wavlm',
'hubert',
'unispeech',
'unispeech-sat',
]


Expand Down Expand Up @@ -386,7 +400,7 @@ def main():
**get_main_export_kwargs(config, "automatic-speech-recognition")
)

elif config.model_type in ('wav2vec2', 'wav2vec2-bert', 'hubert'):
elif config.model_type in ('wav2vec2', 'wav2vec2-bert', 'hubert', 'unispeech' , 'unispeech-sat'):
if tokenizer is not None:
from .extra.wav2vec2 import generate_tokenizer_json
tokenizer_json = generate_tokenizer_json(tokenizer)
Expand Down
50 changes: 50 additions & 0 deletions scripts/supported_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -924,6 +924,45 @@
'microsoft/trocr-base-handwritten',
],
},
'unispeech': {
# Feature extraction
'feature-extraction': [
# Requires --task feature-extraction
'microsoft/unispeech-large-1500h-cv',
],
# TODO: add support for
# # Automatic speech recognition
# 'automatic-speech-recognition': [
# 'microsoft/unispeech-1350-en-353-fr-ft-1h',
# 'microsoft/unispeech-1350-en-17h-ky-ft-1h',
# 'microsoft/unispeech-1350-en-90-it-ft-1h',
# 'microsoft/unispeech-1350-en-168-es-ft-1h',
# ],
},
'unispeech-sat': {
# Feature extraction
'feature-extraction': [
# Requires --task feature-extraction
'microsoft/unispeech-sat-base',
],

# Audio XVector (e.g., for speaker verification)
'audio-xvector': [
'microsoft/unispeech-sat-base-plus-sv',
'microsoft/unispeech-sat-base-sv',
'microsoft/unispeech-sat-large-sv',
],

# Audio frame classification
'audio-frame-classification': [
'microsoft/unispeech-sat-base-plus-sd',
],

# Automatic speech recognition
'automatic-speech-recognition': [
'microsoft/unispeech-sat-base-100h-libri-ft',
],
},
'vision-encoder-decoder': {
# Image-to-text
'image-to-text': [
Expand Down Expand Up @@ -993,6 +1032,11 @@
'facebook/mms-lid-4017',
],

# Audio frame classification
'audio-frame-classification': [
'anton-l/wav2vec2-base-superb-sd',
],

# Automatic speech recognition
'automatic-speech-recognition': [
'jonatasgrosman/wav2vec2-large-xlsr-53-english',
Expand Down Expand Up @@ -1020,6 +1064,12 @@
'microsoft/wavlm-large',
],

# Audio frame classification
'audio-frame-classification': [
'anton-l/wav2vec2-base-superb-sd',
'microsoft/wavlm-base-plus-sd',
],

# Audio XVector (e.g., for speaker verification)
'audio-xvector': [
'microsoft/wavlm-base-plus-sv',
Expand Down
103 changes: 102 additions & 1 deletion src/models.js
Original file line number Diff line number Diff line change
Expand Up @@ -4574,7 +4574,97 @@ export class Wav2Vec2ForSequenceClassification extends Wav2Vec2PreTrainedModel {
//////////////////////////////////////////////////

//////////////////////////////////////////////////
// Wav2Vec2 models
// UniSpeech models
export class UniSpeechPreTrainedModel extends PreTrainedModel { };

/**
* The bare UniSpeech Model transformer outputting raw hidden-states without any specific head on top.
*/
export class UniSpeechModel extends UniSpeechPreTrainedModel { }

/**
* UniSpeech Model with a `language modeling` head on top for Connectionist Temporal Classification (CTC).
*/
export class UniSpeechForCTC extends UniSpeechPreTrainedModel {
/**
* @param {Object} model_inputs
* @param {Tensor} model_inputs.input_values Float values of input raw speech waveform.
* @param {Tensor} model_inputs.attention_mask Mask to avoid performing convolution and attention on padding token indices. Mask values selected in [0, 1]
*/
async _call(model_inputs) {
return new CausalLMOutput(await super._call(model_inputs));
}
}

/**
* UniSpeech Model with a sequence classification head on top (a linear layer over the pooled output).
*/
export class UniSpeechForSequenceClassification extends UniSpeechPreTrainedModel {
/**
* Calls the model on new inputs.
* @param {Object} model_inputs The inputs to the model.
* @returns {Promise<SequenceClassifierOutput>} An object containing the model's output logits for sequence classification.
*/
async _call(model_inputs) {
return new SequenceClassifierOutput(await super._call(model_inputs));
}
}
//////////////////////////////////////////////////

//////////////////////////////////////////////////
// UniSpeechSat models
export class UniSpeechSatPreTrainedModel extends PreTrainedModel { };

/**
* The bare UniSpeechSat Model transformer outputting raw hidden-states without any specific head on top.
*/
export class UniSpeechSatModel extends UniSpeechSatPreTrainedModel { }

/**
* UniSpeechSat Model with a `language modeling` head on top for Connectionist Temporal Classification (CTC).
*/
export class UniSpeechSatForCTC extends UniSpeechSatPreTrainedModel {
/**
* @param {Object} model_inputs
* @param {Tensor} model_inputs.input_values Float values of input raw speech waveform.
* @param {Tensor} model_inputs.attention_mask Mask to avoid performing convolution and attention on padding token indices. Mask values selected in [0, 1]
*/
async _call(model_inputs) {
return new CausalLMOutput(await super._call(model_inputs));
}
}

/**
* UniSpeechSat Model with a sequence classification head on top (a linear layer over the pooled output).
*/
export class UniSpeechSatForSequenceClassification extends UniSpeechSatPreTrainedModel {
/**
* Calls the model on new inputs.
* @param {Object} model_inputs The inputs to the model.
* @returns {Promise<SequenceClassifierOutput>} An object containing the model's output logits for sequence classification.
*/
async _call(model_inputs) {
return new SequenceClassifierOutput(await super._call(model_inputs));
}
}

/**
* UniSpeechSat Model with a frame classification head on top for tasks like Speaker Diarization.
*/
export class UniSpeechSatForAudioFrameClassification extends UniSpeechSatPreTrainedModel {
/**
* Calls the model on new inputs.
* @param {Object} model_inputs The inputs to the model.
* @returns {Promise<TokenClassifierOutput>} An object containing the model's output logits for sequence classification.
*/
async _call(model_inputs) {
return new TokenClassifierOutput(await super._call(model_inputs));
}
}
//////////////////////////////////////////////////

//////////////////////////////////////////////////
// Wav2Vec2Bert models
export class Wav2Vec2BertPreTrainedModel extends PreTrainedModel { };

/**
Expand Down Expand Up @@ -5289,6 +5379,8 @@ const MODEL_MAPPING_NAMES_ENCODER_ONLY = new Map([
['squeezebert', ['SqueezeBertModel', SqueezeBertModel]],
['wav2vec2', ['Wav2Vec2Model', Wav2Vec2Model]],
['wav2vec2-bert', ['Wav2Vec2BertModel', Wav2Vec2BertModel]],
['unispeech', ['UniSpeechModel', UniSpeechModel]],
['unispeech-sat', ['UniSpeechSatModel', UniSpeechSatModel]],
['hubert', ['HubertModel', HubertModel]],
['wavlm', ['WavLMModel', WavLMModel]],
['audio-spectrogram-transformer', ['ASTModel', ASTModel]],
Expand Down Expand Up @@ -5514,13 +5606,17 @@ const MODEL_FOR_MASK_GENERATION_MAPPING_NAMES = new Map([
const MODEL_FOR_CTC_MAPPING_NAMES = new Map([
['wav2vec2', ['Wav2Vec2ForCTC', Wav2Vec2ForCTC]],
['wav2vec2-bert', ['Wav2Vec2BertForCTC', Wav2Vec2BertForCTC]],
['unispeech', ['UniSpeechForCTC', UniSpeechForCTC]],
['unispeech-sat', ['UniSpeechSatForCTC', UniSpeechSatForCTC]],
['wavlm', ['WavLMForCTC', WavLMForCTC]],
['hubert', ['HubertForCTC', HubertForCTC]],
]);

const MODEL_FOR_AUDIO_CLASSIFICATION_MAPPING_NAMES = new Map([
['wav2vec2', ['Wav2Vec2ForSequenceClassification', Wav2Vec2ForSequenceClassification]],
['wav2vec2-bert', ['Wav2Vec2BertForSequenceClassification', Wav2Vec2BertForSequenceClassification]],
['unispeech', ['UniSpeechForSequenceClassification', UniSpeechForSequenceClassification]],
['unispeech-sat', ['UniSpeechSatForSequenceClassification', UniSpeechSatForSequenceClassification]],
['wavlm', ['WavLMForSequenceClassification', WavLMForSequenceClassification]],
['hubert', ['HubertForSequenceClassification', HubertForSequenceClassification]],
['audio-spectrogram-transformer', ['ASTForAudioClassification', ASTForAudioClassification]],
Expand All @@ -5530,6 +5626,10 @@ const MODEL_FOR_AUDIO_XVECTOR_MAPPING_NAMES = new Map([
['wavlm', ['WavLMForXVector', WavLMForXVector]],
]);

const MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING_NAMES = new Map([
['unispeech-sat', ['UniSpeechSatForAudioFrameClassification', UniSpeechSatForAudioFrameClassification]],
]);

const MODEL_FOR_IMAGE_MATTING_MAPPING_NAMES = new Map([
['vitmatte', ['VitMatteForImageMatting', VitMatteForImageMatting]],
]);
Expand Down Expand Up @@ -5571,6 +5671,7 @@ const MODEL_CLASS_TYPE_MAPPING = [
[MODEL_FOR_TEXT_TO_SPECTROGRAM_MAPPING_NAMES, MODEL_TYPES.Seq2Seq],
[MODEL_FOR_TEXT_TO_WAVEFORM_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_AUDIO_XVECTOR_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
[MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING_NAMES, MODEL_TYPES.EncoderOnly],
];

for (const [mappings, type] of MODEL_CLASS_TYPE_MAPPING) {
Expand Down
2 changes: 2 additions & 0 deletions src/pipelines.js
Original file line number Diff line number Diff line change
Expand Up @@ -1526,6 +1526,8 @@ export class AutomaticSpeechRecognitionPipeline extends (/** @type {new (options
return this._call_whisper(audio, kwargs)
case 'wav2vec2':
case 'wav2vec2-bert':
case 'unispeech':
case 'unispeech-sat':
case 'hubert':
return this._call_wav2vec2(audio, kwargs)
default:
Expand Down
Loading