Skip to content

Implementation of Google's Tacotron in TensorFlow

Notifications You must be signed in to change notification settings

onyedikilo/Tacotron-1

 
 

Repository files navigation

Tacotron

Implementation of Tacotron, an end-to-end neural network for speech synthesis.

Samples

The following playlist contains samples produced on unseen inputs by Tacotron trained for 250K steps on the Nancy Corpus with r=5.

Samples

As you can hear, output is pretty rough around the edges but you can make out words on new inputs and it should get better with more training, tuning and data.

Unsurprisingly, the biggest bottleneck to increased performance seems to be the quality and scale of the dataset. Google use an internal 25 hour dataset which is about 4 times the size of the Nancy corpus we use here so their results are more convincing. I have trained a model on ~20 hours of audiobook data from the 2013 Blizzard challenge, but inconsistent prosody and accents actually caused the outputs to sound much worse than those above. Any suggestions for a better dataset to try are welcomed.

Requirements

Tensorflow 1.2

Librosa

tqdm

Data

For best results, use the Nancy corpus from the 2011 Blizzard Challenge. The data is freely availiable for research use on the signing of a license. After obtaining a username and password, add them to the 'download_data.sh' script to fetch it automatically.

We also download the considerably smaller CMU ARCTIC dataset for testing which can be obtained without a license, but don't expect to get good results with it.

Usage

First run the data fetching script (preferably after obtaining a username and password for the Nancy corpus)

bash download_data.sh

Then preprocess the data

python3 preprocess.py

Now we're ready to start training

python3 train.py --train-set nancy 

Finally, create a text file containing the prompts you want to synthesize

python3 test.py prompts.txt

To see the audio outputs created by Tacotron, open up Tensorboard.

On my GTX 1080, it takes about 5 hours to get to the point where synthesized speech on the training set is discernable and around 20 hours to obtain audible generalization at test time. Despite fairly agressive gradient clipping, the loss is prone to explosion. In that case try restarting from the most recent checkpoint (using the restore flag) with a slightly lowered learning rate. I'm working on improving this problem.

About

Implementation of Google's Tacotron in TensorFlow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.5%
  • Shell 2.5%