-
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
a5a73d9
commit 00e7c59
Showing
2 changed files
with
165 additions
and
271 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,106 +1,197 @@ | ||
# Roboy Sonosco | ||
Roboy Sonosco (from Lat. sonus - sound and nōscō - I know, recognize) - a library for Speech Recognition based on Deep Learning models | ||
![# Sonosco](./docs/imgs/sonosco_3.jpg) | ||
<br> | ||
<br> | ||
<br> | ||
<br> | ||
|
||
## Installation | ||
Sonosco (from Lat. sonus - sound and nōscō - I know, recognize) | ||
is a library for training and deploying deep speech recognition models. | ||
|
||
The supported OS is Ubuntu 18.04 LTS (however, it should work fine on other distributions). | ||
Supported Python version is 3.6+. | ||
Supported CUDA version is 10.0. | ||
Supported PyTorch version is 1.0. | ||
The goal of this project is to enable fast, repeatable and structured training of deep | ||
automatic speech recognition (ASR) models as well as providing a transcription server (REST API & frontend) to | ||
try out the trained models for transcription. <br> | ||
Additionally, we provide interfaces to ROS in order to use it with | ||
the anthropomimetic robot [Roboy](https://roboy.org/). | ||
<br> | ||
<br> | ||
<br> | ||
|
||
--- | ||
___ | ||
### Installation | ||
|
||
Install CUDA 10.0 from [NVIDIA website](https://developer.nvidia.com/cuda-10.0-download-archive). Make sure that your local gcc, g++, cmake versions are not older than the ones used to compile your OS kernel. | ||
|
||
You will need to download the latest [cuDNN](https://developer.nvidia.com/rdp/cudnn-archive) for CUDA 10.0. | ||
Unzip it: | ||
``` | ||
tar -xzvf cudnn-9.0-linux-x64-v7.tgz | ||
``` | ||
Run | ||
#### Via pip | ||
The easiest way to use Sonosco's functionality is via pip: | ||
``` | ||
sudo cp cuda/include/cudnn.h /usr/local/cuda/include | ||
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64 | ||
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn* | ||
pip install sonosco | ||
``` | ||
--- | ||
**Note**: Sonosco requires Python 3.7 or higher. | ||
|
||
**All of the following steps you may perform inside [Anaconda](https://www.anaconda.com/) or [virtualenv](https://virtualenv.pypa.io/en/latest/)** | ||
For reliability, we recommend using an environment virtualization tool, like virtualenv or conda. | ||
|
||
Install [PyTorch](https://pytorch.org/get-started/locally/). For your particular configuration, you may want to build it from the [sources](https://github.com/pytorch/pytorch). | ||
<br> | ||
<br> | ||
#### For developers or trying out the transcription server | ||
|
||
Install SeanNaren's fork for Warp-CTC bindings. **Deprecated**: will be updated to use [built-in](https://pytorch.org/docs/stable/nn.html#torch.nn.CTCLoss) functions. | ||
``` | ||
git clone https://github.com/SeanNaren/warp-ctc.git | ||
cd warp-ctc; mkdir build; cd build; cmake ..; make | ||
export CUDA_HOME="/usr/local/cuda" | ||
cd ../pytorch_binding && python setup.py install | ||
Clone the repository and install dependencies: | ||
``` | ||
# Create a virtual python environment to not pollute the global setup | ||
conda create -n 'sonosco' python=3.7 | ||
Install pytorch audio: | ||
``` | ||
sudo apt-get install sox libsox-dev libsox-fmt-all | ||
git clone https://github.com/pytorch/audio.git | ||
cd audio && python setup.py install | ||
# activate the virtual environment | ||
conda activate sonosco | ||
# Clone the repo | ||
git clone https://github.com/Roboy/sonosco.git | ||
# Install normal requirements | ||
pip install -r requirements.txt | ||
# Link your local sonosco clone into your virtual environment | ||
pip install . | ||
``` | ||
Now you can check out some of the [Getting Started]() tutorials, to train a model or use | ||
the transcription server. | ||
<br> | ||
<br> | ||
<br> | ||
____________ | ||
### High Level Design | ||
|
||
|
||
![# High-Level-Design](./docs/imgs/high-level-design.svg) | ||
|
||
The project is split into 4 parts that correlate with each other: | ||
|
||
For data(-processing) scripts are provided to download and preprocess | ||
some publicly available datasets for speech recognition. Additionally, | ||
we provide scripts and functions to create manifest files | ||
(i.e. catalog files) for your own data and merge existing manifest files | ||
into one. | ||
|
||
This data or rather the manifest files can then be used to easily train and | ||
evaluate an ASR model. We provide some ASR model architectures, such as LAS, | ||
TDS and DeepSpeech2 but also individual pytorch models can be designed to be trained. | ||
|
||
The trained model can then be used in a transcription server, that consists | ||
of a REST API as well as a simple Vue.js frontend to transcribe voice recorded | ||
by a microphone and compare the transcription results to other models (that can | ||
be downloaded in our [Github](https://github.com/Roboy/sonosco) repository). | ||
|
||
Further we provide example code, how to use different ASR models with ROS | ||
and especially the Roboy ROS interfaces (i.e. topics & messages). | ||
|
||
<br> | ||
<br> | ||
|
||
|
||
______ | ||
### Data (-processing) | ||
|
||
##### Downloading publicly available datasets | ||
We provide scripts to download and process the following publicly available datasets: | ||
* [An4](http://www.speech.cs.cmu.edu/databases/an4/) - Alphanumeric database | ||
* [Librispeech](http://www.openslr.org/12) - reading english books | ||
* [TED-LIUM 3](https://lium.univ-lemans.fr/en/ted-lium3/) (ted3) - TED talks | ||
* [Voxforge](http://www.voxforge.org/home/downloads) | ||
* common voice (old version) | ||
|
||
Simply run the respective scripts in `sonosco > datasets > download_datasets` with the | ||
output_path flag and it will download and process the dataset. Further, it will create | ||
a manifest file for the dataset. | ||
|
||
For example | ||
|
||
If you want decoding to support beam search with an optional language model, install [ctcdecode](https://github.com/parlance/ctcdecode): | ||
``` | ||
git clone --recursive https://github.com/parlance/ctcdecode.git | ||
cd ctcdecode && pip install . | ||
python an4.py --target-dir temp/data/an4 | ||
``` | ||
<br> | ||
<br> | ||
|
||
Clone this repo and run this within the repo: | ||
##### Creating a manifest from your own data | ||
|
||
If you want to create a manifest from your own data, order your files as follows: | ||
``` | ||
pip install -r requirements.txt | ||
data_directory | ||
└───txt | ||
│ │ transcription01.txt | ||
│ │ transcription02.txt | ||
│ | ||
└───wav | ||
│ audio01.wav | ||
│ audio02.wav | ||
``` | ||
To create a manifest, run the `create_manifest.py` script with the data directory and an outputfile | ||
to automatically create a manifest file for your data. | ||
|
||
### Mixed Precision | ||
If you want to use mixed precision training, you have to install [NVIDIA Apex](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/): | ||
For example: | ||
``` | ||
git clone --recursive https://github.com/NVIDIA/apex.git | ||
cd apex && pip install . | ||
python create_manifest.py --data_path path/to/data_directory --output-file temp/data/manifest.csv | ||
``` | ||
|
||
## Usage | ||
<br> | ||
<br> | ||
|
||
### Dataset | ||
##### Merging manifest files | ||
|
||
To create a dataset you must create a CSV manifest file containing the locations of the training data. This has to be in the format of: | ||
In order to merge multiple manifests into one, just specify a folder that contains all manifest | ||
files to be merged and run the ``` merge_manifest.py```. | ||
This will look for all .csv files and merge the content together in the specified output-file. | ||
|
||
For example: | ||
``` | ||
/path/to/audio.wav,/path/to/text.txt | ||
/path/to/audio2.wav,/path/to/text2.txt | ||
... | ||
python merge_manifest.py --merge-dir path/to/manifests_dir --output-path temp/manifests/merged_manifest.csv | ||
``` | ||
There is an example in examples directory. | ||
|
||
### Training, Testing and Inference | ||
<br> | ||
<br> | ||
|
||
Fundamentally, you can run the scripts the same way: | ||
``` | ||
python3 train.py --config /path/to/config/file.yaml | ||
python3 test.py --config /path/to/config/file.yaml | ||
python3 infer.py --config /path/to/config/file.yaml | ||
``` | ||
The scripts are initialised via configuration files. | ||
|
||
#### Configuration | ||
___ | ||
### Model Training | ||
|
||
Configuration file contains arguments for ModelWrapper initialisation as well as extra parameters. Like this: | ||
``` | ||
train: | ||
... | ||
log-dir: 'logs' # Location for log files | ||
def-dir: 'examples/checkpoints/', # Default location to save/load models | ||
model-name: 'asr_final.pth' # File name to save the best model | ||
sample-rate: 16000 # Sample rate | ||
window: 'hamming' # Window type for spectrogram generation | ||
batch-size: 32 # Batch size for training | ||
checkpoint: True # Enables checkpoint saving of model | ||
... | ||
``` | ||
More configuration examples with descriptions you may find in the config directory. | ||
One goal of this framework is to keep training as easy as possible and enable | ||
keeping track of already conducted experiments. | ||
<br> | ||
<br> | ||
|
||
#### Analysis Object Model | ||
|
||
For model training, there are multiple objects that interact with each other. | ||
|
||
![# Analysis Object Model](./docs/imgs/aom.svg) | ||
|
||
For Model training, one can define different metrics, that get evaluated during the training | ||
process. These metrics get evaluated at specified steps during an epoch and during | ||
validation.<br> | ||
Sonosco provides different metrics already, such as [Word Error Rate (WER)]() or | ||
[Character Error Rate (CER)](). But additional metrics can be created in a similar scheme. | ||
See [Metrics](). | ||
|
||
Additionally, callbacks can be defined. A Callback is an arbitrary code that can be executed during | ||
training. Sonosco provides for example different Callbacks, such as [Learning Rate Reduction](), | ||
[ModelSerializationCallback](), [TensorboardCallback](), ... <br> | ||
Custom Callbacks can be defined following the examples. See [Callbacks](). | ||
|
||
Most importantly, a model needs to be defined. The model is basically any torch module. For | ||
(de-) serialization, this model needs to conform to the [Serialization Guide]().<br> | ||
Sonosco provides already existing model architectures that can be simply imported, such as | ||
[Listen Attend Spell](), [Time-depth Separable Convolutions]() and [DeepSpeech2](). | ||
|
||
We created a specific AudioDataset Class that is based on the pytorch Dataset class. | ||
This AudioDataset requires an AudioDataProcessor in order to process the specified manifest file. | ||
Further we created a special AudioDataLoader based on pytorch's Dataloader class, that | ||
takes the AudioDataset and provides the data in batches to the model training. | ||
|
||
Metrics, Callbacks, the Model and the AudioDataLoader are then provided to the ModelTrainer. | ||
This ModelTrainer takes care of the training process. See [Getting Starter](). | ||
|
||
## Acknowledgements | ||
The ModelTrainer can then be registered to the Experiment, that takes care of provenance. | ||
I.e. when starting the training, all your code is time_stamped and saved in a separate directory, | ||
so you can always repeat the same experiment. Additionally, the serialized model and modeltrainer, | ||
logs and tensorboard logs are saved in this folder. | ||
|
||
This project is partially based on SeanNaren's [deepspeech.pytorch](https://github.com/SeanNaren/deepspeech.pytorch) repository. | ||
Further, a Serializer needs to be provided to the Experiment. This object can serialize any | ||
arbitrary class with its parameters, that can then be deserialized using the Deserializer.<br> | ||
When the ```Èxperiment.stop()``` method is called, the model and the ModelTrainer get serialized, | ||
so that you can simply continue the training, with all current parameters (such as epoch steps,...) | ||
when deserializing the ModelTrainer and continuing training. |
Oops, something went wrong.