forked from espnet/espnet
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request espnet#5372 from Masao-Someki/feature/espnetez
Add espnetez
- Loading branch information
Showing
19 changed files
with
1,509 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# LoRA finetune related | ||
use_lora: true | ||
|
||
rir_scp: null | ||
rir_apply_prob: 1.0 | ||
noise_scp: null | ||
noise_apply_prob: 1.0 | ||
noise_db_range: '13_15' | ||
speech_volume_normalize: null | ||
non_linguistic_symbols: null | ||
|
||
preprocessor_conf: | ||
speech_name: speech | ||
text_name: text | ||
|
||
# training related | ||
seed: 2022 | ||
num_workers: 4 | ||
ngpu: 1 | ||
batch_type: numel | ||
batch_bins: 1600000 | ||
accum_grad: 4 | ||
max_epoch: 70 | ||
patience: null | ||
init: null | ||
best_model_criterion: | ||
- - valid | ||
- acc | ||
- max | ||
keep_nbest_models: 10 | ||
use_amp: true | ||
|
||
optim: adam | ||
optim_conf: | ||
lr: 0.002 | ||
weight_decay: 0.000001 | ||
scheduler: warmuplr | ||
scheduler_conf: | ||
warmup_steps: 15000 | ||
|
||
specaug: specaug | ||
specaug_conf: | ||
apply_time_warp: true | ||
time_warp_window: 5 | ||
time_warp_mode: bicubic | ||
apply_freq_mask: true | ||
freq_mask_width_range: | ||
- 0 | ||
- 27 | ||
num_freq_mask: 2 | ||
apply_time_mask: true | ||
time_mask_width_ratio_range: | ||
- 0. | ||
- 0.05 | ||
num_time_mask: 5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,328 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Sample demo for ESPnet-Easy!\n", | ||
"In this notebook, we will demonstrate how to train an Automatic Speech Recognition (ASR) model using the Librispeech-100 dataset. The process in this notebook follows the same dataset preparation approach as the kaldi-style dataset. If you are interested in fine-tuning pretrained models, please refer to the libri100_finetune.ipynb file.\n", | ||
"\n", | ||
"Before proceeding, please ensure that you have already downloaded the Librispeech-100 dataset from [OpenSLR](https://www.openslr.org/12) and have placed the data in a directory of your choice. In this notebook, we assume that you have stored the dataset in the `/hdd/dataset/` directory. If your dataset is located in a different directory, please make sure to replace `/hdd/dataset/` with the actual path to your dataset." | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Data Preparation\n", | ||
"\n", | ||
"This notebook follows the data preparation steps outlined in `asr.sh`. Initially, we will create a dump file to store information about the data, including the data ID, audio path, and transcriptions.\n", | ||
"\n", | ||
"ESPnet-Easy supports various types of datasets, including:\n", | ||
"\n", | ||
"1. Dictionary-based dataset with the following structure:\n", | ||
" ```python\n", | ||
" {\n", | ||
" \"data_id\": {\n", | ||
" \"speech\": path_to_speech_file,\n", | ||
" \"text\": transcription\n", | ||
" }\n", | ||
" }\n", | ||
" ```\n", | ||
"\n", | ||
"2. List of datasets with the following structure:\n", | ||
" ```python\n", | ||
" [\n", | ||
" {\n", | ||
" \"speech\": path_to_speech_file,\n", | ||
" \"text\": transcription\n", | ||
" }\n", | ||
" ]\n", | ||
" ```\n", | ||
"\n", | ||
"If you choose to use a dictionary-based dataset, it's essential to ensure that each `data_id` is unique. ESPnet-Easy also accepts a dump file that may have already been created by `asr.sh`. However, in this notebook, we will create the dump file from scratch." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Need to install espnet if you don't have it\n", | ||
"%pip install -U ../../\n", | ||
"%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --no-cache" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now, let's create dump files! \n", | ||
"Please note that you will need to provide a dictionary to specify the file path and type for each data.\n", | ||
"This dictionary should have the following format:\n", | ||
"\n", | ||
"```python\n", | ||
"{\n", | ||
" \"data_name\": [\"dump_file_name\", \"dump_format\"]\n", | ||
"}\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import glob\n", | ||
"\n", | ||
"import espnetez as ez\n", | ||
"\n", | ||
"\n", | ||
"DUMP_DIR = \"./dump/libri100\"\n", | ||
"LIBRI_100_DIRS = [\n", | ||
" [\"/hdd/database/librispeech-100/LibriSpeech/train-clean-100\", \"train\"],\n", | ||
" [\"/hdd/database/librispeech-100/LibriSpeech/dev-clean\", \"dev-clean\"],\n", | ||
" [\"/hdd/database/librispeech-100/LibriSpeech/dev-other\", \"dev-other\"],\n", | ||
"]\n", | ||
"data_info = {\n", | ||
" \"speech\": [\"wav.scp\", \"sound\"],\n", | ||
" \"text\": [\"text\", \"text\"],\n", | ||
"}\n", | ||
"\n", | ||
"\n", | ||
"def create_dataset(data_dir):\n", | ||
" dataset = {}\n", | ||
" for chapter in glob.glob(os.path.join(data_dir, \"*/*\")):\n", | ||
" text_file = glob.glob(os.path.join(chapter, \"*.txt\"))[0]\n", | ||
"\n", | ||
" with open(text_file, \"r\") as f:\n", | ||
" lines = f.readlines()\n", | ||
"\n", | ||
" ids_text = {\n", | ||
" line.split(\" \")[0]: line.split(\" \", maxsplit=1)[1].replace(\"\\n\", \"\")\n", | ||
" for line in lines\n", | ||
" }\n", | ||
" audio_files = glob.glob(os.path.join(chapter, \"*.wav\"))\n", | ||
" for audio_file in audio_files:\n", | ||
" audio_id = os.path.basename(audio_file)[: -len(\".wav\")]\n", | ||
" dataset[audio_id] = {\n", | ||
" \"speech\": audio_file,\n", | ||
" \"text\": ids_text[audio_id]\n", | ||
" }\n", | ||
" return dataset\n", | ||
"\n", | ||
"\n", | ||
"for d, n in LIBRI_100_DIRS:\n", | ||
" dump_dir = os.path.join(DUMP_DIR, n)\n", | ||
" if not os.path.exists(dump_dir):\n", | ||
" os.makedirs(dump_dir)\n", | ||
"\n", | ||
" dataset = create_dataset(d)\n", | ||
" ez.data.create_dump_file(dump_dir, dataset, data_info)" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For the validation files, you have two directories: `dev-clean` and `dev-other`.\n", | ||
"To create a unified dev dataset, you can use the `ez.data.join_dumps` function." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"ez.data.join_dumps(\n", | ||
" [\"./dump/libri100/dev-clean\", \"./dump/libri100/dev-other\"], \"./dump/libri100/dev\"\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now you have dataset files in the `dump` directory.\n", | ||
"It looks like this:\n", | ||
"\n", | ||
"wav.scp\n", | ||
"```\n", | ||
"1255-138279-0008 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0008.flac\n", | ||
"1255-138279-0022 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0022.flac\n", | ||
"```\n", | ||
"\n", | ||
"text\n", | ||
"```\n", | ||
"1255-138279-0008 TWO THREE\n", | ||
"1255-138279-0022 IF I SAID SO OF COURSE I WILL\n", | ||
"```\n" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Train sentencepiece model\n", | ||
"\n", | ||
"To train a SentencePiece model, we require a text file for training. Let's begin by creating the training file." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# generate training texts from the training data\n", | ||
"# you can select several datasets to train sentencepiece.\n", | ||
"ez.preprocess.prepare_sentences([\"dump/libri100/train/text\"], \"dump/spm\")\n", | ||
"\n", | ||
"ez.preprocess.train_sentencepiece(\n", | ||
" \"dump/spm/train.txt\",\n", | ||
" \"data/bpemodel\",\n", | ||
" vocab_size=5000,\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Configure Training Process\n", | ||
"\n", | ||
"For configuring the training process, you can utilize the configuration files already provided by ESPnet contributors. To use a configuration file, you'll need to create a YAML file on your local machine. For instance, you can use the [e-branchformer config](train_asr_e-branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml).\n", | ||
"\n", | ||
"In my case, I've made a modification to the `batch_bins` parameter, changing it from `16000000` to `1600000` to run training on my GPU (RTX2080ti)." | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Training\n", | ||
"\n", | ||
"To prepare the stats file before training, you can execute the `collect_stats` method. This step is required before the training process and ensuring accurate statistics for the model." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import espnetez as ez\n", | ||
"\n", | ||
"EXP_DIR = \"exp/train_asr_branchformer_e24_amp\"\n", | ||
"STATS_DIR = \"exp/stats\"\n", | ||
"\n", | ||
"# load config\n", | ||
"training_config = ez.config.from_yaml(\n", | ||
" \"asr\",\n", | ||
" \"train_asr_e_branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml\",\n", | ||
")\n", | ||
"preprocessor_config = ez.utils.load_yaml(\"preprocess.yaml\")\n", | ||
"training_config.update(preprocessor_config)\n", | ||
"\n", | ||
"with open(preprocessor_config[\"token_list\"], \"r\") as f:\n", | ||
" training_config[\"token_list\"] = [t.replace(\"\\n\", \"\") for t in f.readlines()]\n", | ||
"\n", | ||
"# Define the Trainer class\n", | ||
"trainer = ez.Trainer(\n", | ||
" task='asr',\n", | ||
" train_config=training_config,\n", | ||
" train_dump_dir=\"dump/libri100/train\",\n", | ||
" valid_dump_dir=\"dump/libri100/dev\",\n", | ||
" data_info=data_info,\n", | ||
" output_dir=EXP_DIR,\n", | ||
" stats_dir=STATS_DIR,\n", | ||
" ngpu=1,\n", | ||
")\n", | ||
"trainer.collect_stats()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, we are ready to begin the training process!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"trainer.train()" | ||
] | ||
}, | ||
{ | ||
"attachments": {}, | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Inference\n", | ||
"You can just use the inference API of the ESPnet." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import librosa\n", | ||
"from espnet2.bin.asr_inference import Speech2Text\n", | ||
"\n", | ||
"m = Speech2Text(\n", | ||
" \"./exp/train_asr_branchformer_e24_amp/config.yaml\",\n", | ||
"\t\"./exp/train_asr_branchformer_e24_amp/valid.acc.best.pth\",\n", | ||
"\tbeam_size=10\n", | ||
")\n", | ||
"\n", | ||
"with open(\"./dump/libri100/dev/wav.scp\", \"r\") as f:\n", | ||
" sample_path = f.readlines()[0]\n", | ||
" \n", | ||
"y, sr = librosa.load(sample_path.split()[1], sr=16000, mono=True)\n", | ||
"output = m(y)\n", | ||
"print(output[0][0])\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.10" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.