Skip to content

Commit

Permalink
Merge pull request espnet#5372 from Masao-Someki/feature/espnetez
Browse files Browse the repository at this point in the history
Add espnetez
  • Loading branch information
sw005320 authored Dec 9, 2023
2 parents 8ad0d3e + c788444 commit a4c5547
Show file tree
Hide file tree
Showing 19 changed files with 1,509 additions and 0 deletions.
55 changes: 55 additions & 0 deletions egsez/asr/finetune_with_lora.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# LoRA finetune related
use_lora: true

rir_scp: null
rir_apply_prob: 1.0
noise_scp: null
noise_apply_prob: 1.0
noise_db_range: '13_15'
speech_volume_normalize: null
non_linguistic_symbols: null

preprocessor_conf:
speech_name: speech
text_name: text

# training related
seed: 2022
num_workers: 4
ngpu: 1
batch_type: numel
batch_bins: 1600000
accum_grad: 4
max_epoch: 70
patience: null
init: null
best_model_criterion:
- - valid
- acc
- max
keep_nbest_models: 10
use_amp: true

optim: adam
optim_conf:
lr: 0.002
weight_decay: 0.000001
scheduler: warmuplr
scheduler_conf:
warmup_steps: 15000

specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 27
num_freq_mask: 2
apply_time_mask: true
time_mask_width_ratio_range:
- 0.
- 0.05
num_time_mask: 5
328 changes: 328 additions & 0 deletions egsez/asr/libri100.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Sample demo for ESPnet-Easy!\n",
"In this notebook, we will demonstrate how to train an Automatic Speech Recognition (ASR) model using the Librispeech-100 dataset. The process in this notebook follows the same dataset preparation approach as the kaldi-style dataset. If you are interested in fine-tuning pretrained models, please refer to the libri100_finetune.ipynb file.\n",
"\n",
"Before proceeding, please ensure that you have already downloaded the Librispeech-100 dataset from [OpenSLR](https://www.openslr.org/12) and have placed the data in a directory of your choice. In this notebook, we assume that you have stored the dataset in the `/hdd/dataset/` directory. If your dataset is located in a different directory, please make sure to replace `/hdd/dataset/` with the actual path to your dataset."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Preparation\n",
"\n",
"This notebook follows the data preparation steps outlined in `asr.sh`. Initially, we will create a dump file to store information about the data, including the data ID, audio path, and transcriptions.\n",
"\n",
"ESPnet-Easy supports various types of datasets, including:\n",
"\n",
"1. Dictionary-based dataset with the following structure:\n",
" ```python\n",
" {\n",
" \"data_id\": {\n",
" \"speech\": path_to_speech_file,\n",
" \"text\": transcription\n",
" }\n",
" }\n",
" ```\n",
"\n",
"2. List of datasets with the following structure:\n",
" ```python\n",
" [\n",
" {\n",
" \"speech\": path_to_speech_file,\n",
" \"text\": transcription\n",
" }\n",
" ]\n",
" ```\n",
"\n",
"If you choose to use a dictionary-based dataset, it's essential to ensure that each `data_id` is unique. ESPnet-Easy also accepts a dump file that may have already been created by `asr.sh`. However, in this notebook, we will create the dump file from scratch."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Need to install espnet if you don't have it\n",
"%pip install -U ../../\n",
"%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 --no-cache"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's create dump files! \n",
"Please note that you will need to provide a dictionary to specify the file path and type for each data.\n",
"This dictionary should have the following format:\n",
"\n",
"```python\n",
"{\n",
" \"data_name\": [\"dump_file_name\", \"dump_format\"]\n",
"}\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import glob\n",
"\n",
"import espnetez as ez\n",
"\n",
"\n",
"DUMP_DIR = \"./dump/libri100\"\n",
"LIBRI_100_DIRS = [\n",
" [\"/hdd/database/librispeech-100/LibriSpeech/train-clean-100\", \"train\"],\n",
" [\"/hdd/database/librispeech-100/LibriSpeech/dev-clean\", \"dev-clean\"],\n",
" [\"/hdd/database/librispeech-100/LibriSpeech/dev-other\", \"dev-other\"],\n",
"]\n",
"data_info = {\n",
" \"speech\": [\"wav.scp\", \"sound\"],\n",
" \"text\": [\"text\", \"text\"],\n",
"}\n",
"\n",
"\n",
"def create_dataset(data_dir):\n",
" dataset = {}\n",
" for chapter in glob.glob(os.path.join(data_dir, \"*/*\")):\n",
" text_file = glob.glob(os.path.join(chapter, \"*.txt\"))[0]\n",
"\n",
" with open(text_file, \"r\") as f:\n",
" lines = f.readlines()\n",
"\n",
" ids_text = {\n",
" line.split(\" \")[0]: line.split(\" \", maxsplit=1)[1].replace(\"\\n\", \"\")\n",
" for line in lines\n",
" }\n",
" audio_files = glob.glob(os.path.join(chapter, \"*.wav\"))\n",
" for audio_file in audio_files:\n",
" audio_id = os.path.basename(audio_file)[: -len(\".wav\")]\n",
" dataset[audio_id] = {\n",
" \"speech\": audio_file,\n",
" \"text\": ids_text[audio_id]\n",
" }\n",
" return dataset\n",
"\n",
"\n",
"for d, n in LIBRI_100_DIRS:\n",
" dump_dir = os.path.join(DUMP_DIR, n)\n",
" if not os.path.exists(dump_dir):\n",
" os.makedirs(dump_dir)\n",
"\n",
" dataset = create_dataset(d)\n",
" ez.data.create_dump_file(dump_dir, dataset, data_info)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For the validation files, you have two directories: `dev-clean` and `dev-other`.\n",
"To create a unified dev dataset, you can use the `ez.data.join_dumps` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ez.data.join_dumps(\n",
" [\"./dump/libri100/dev-clean\", \"./dump/libri100/dev-other\"], \"./dump/libri100/dev\"\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you have dataset files in the `dump` directory.\n",
"It looks like this:\n",
"\n",
"wav.scp\n",
"```\n",
"1255-138279-0008 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0008.flac\n",
"1255-138279-0022 /hdd/database/librispeech-100/LibriSpeech/dev-other/1255/138279/1255-138279-0022.flac\n",
"```\n",
"\n",
"text\n",
"```\n",
"1255-138279-0008 TWO THREE\n",
"1255-138279-0022 IF I SAID SO OF COURSE I WILL\n",
"```\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Train sentencepiece model\n",
"\n",
"To train a SentencePiece model, we require a text file for training. Let's begin by creating the training file."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# generate training texts from the training data\n",
"# you can select several datasets to train sentencepiece.\n",
"ez.preprocess.prepare_sentences([\"dump/libri100/train/text\"], \"dump/spm\")\n",
"\n",
"ez.preprocess.train_sentencepiece(\n",
" \"dump/spm/train.txt\",\n",
" \"data/bpemodel\",\n",
" vocab_size=5000,\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Configure Training Process\n",
"\n",
"For configuring the training process, you can utilize the configuration files already provided by ESPnet contributors. To use a configuration file, you'll need to create a YAML file on your local machine. For instance, you can use the [e-branchformer config](train_asr_e-branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml).\n",
"\n",
"In my case, I've made a modification to the `batch_bins` parameter, changing it from `16000000` to `1600000` to run training on my GPU (RTX2080ti)."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Training\n",
"\n",
"To prepare the stats file before training, you can execute the `collect_stats` method. This step is required before the training process and ensuring accurate statistics for the model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import espnetez as ez\n",
"\n",
"EXP_DIR = \"exp/train_asr_branchformer_e24_amp\"\n",
"STATS_DIR = \"exp/stats\"\n",
"\n",
"# load config\n",
"training_config = ez.config.from_yaml(\n",
" \"asr\",\n",
" \"train_asr_e_branchformer_size256_mlp1024_linear1024_e12_mactrue_edrop0.0_ddrop0.0.yaml\",\n",
")\n",
"preprocessor_config = ez.utils.load_yaml(\"preprocess.yaml\")\n",
"training_config.update(preprocessor_config)\n",
"\n",
"with open(preprocessor_config[\"token_list\"], \"r\") as f:\n",
" training_config[\"token_list\"] = [t.replace(\"\\n\", \"\") for t in f.readlines()]\n",
"\n",
"# Define the Trainer class\n",
"trainer = ez.Trainer(\n",
" task='asr',\n",
" train_config=training_config,\n",
" train_dump_dir=\"dump/libri100/train\",\n",
" valid_dump_dir=\"dump/libri100/dev\",\n",
" data_info=data_info,\n",
" output_dir=EXP_DIR,\n",
" stats_dir=STATS_DIR,\n",
" ngpu=1,\n",
")\n",
"trainer.collect_stats()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we are ready to begin the training process!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trainer.train()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Inference\n",
"You can just use the inference API of the ESPnet."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import librosa\n",
"from espnet2.bin.asr_inference import Speech2Text\n",
"\n",
"m = Speech2Text(\n",
" \"./exp/train_asr_branchformer_e24_amp/config.yaml\",\n",
"\t\"./exp/train_asr_branchformer_e24_amp/valid.acc.best.pth\",\n",
"\tbeam_size=10\n",
")\n",
"\n",
"with open(\"./dump/libri100/dev/wav.scp\", \"r\") as f:\n",
" sample_path = f.readlines()[0]\n",
" \n",
"y, sr = librosa.load(sample_path.split()[1], sr=16000, mono=True)\n",
"output = m(y)\n",
"print(output[0][0])\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading

0 comments on commit a4c5547

Please sign in to comment.