An implementation of Tacotron speech synthesis in Tensorflow.
- Audio Samples after training for 877k steps (~11 days).
- Speech started to become intelligble around 20k steps.
- There hasn't been much improvement since around 200k steps -- loss has gone down, but it's hard to notice listening to the audio.
Earlier this year, Google published a paper, Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model, where they present a neural text-to-speech model that learns to synthesize speech directly from (text, audio) pairs. However, they didn't release their source code or training data. This is an attempt to provide an open-source implementation of the model described in their paper.
The quality isn't as good as Google's demo yet, but hopefully it will get there someday :-). Pull requests are welcome!
Make sure you have Python 3. Then:
pip install -r requirements.txt
Download and unpack a model:
curl | tar x -C /tmp
Run the demo server:
python3 --checkpoint /tmp/tacotron-20170720/model.ckpt
Point your browser at localhost:9000
- Type what you want to synthesize
Note: you need at least 40GB of free disk space to train a model.
Download a speech dataset.
The following are supported out of the box:
- LJ Speech (Public Domain)
- Blizzard 2012 (Creative Commons Attribution Share-Alike)
You can use other datasets if you convert them to the right format. See for an example.
Unpack the dataset into
After unpacking, your tree should look like this for LJ Speech:
tacotron |- LJSpeech-1.0 |- metadata.csv |- wavs
or like this for Blizzard 2012:
tacotron |- Blizzard2012 |- ATrampAbroad | |- sentence_index.txt | |- lab | |- wav |- TheManThatCorruptedHadleyburg |- sentence_index.txt |- lab |- wav
Preprocess the data
python3 --dataset ljspeech
- Use
--dataset blizzard
for Blizzard data
- Use
Train a model
Monitor with Tensorboard (optional)
tensorboard --logdir ~/tacotron/logs-tacotron
The trainer dumps audio and alignments every 1000 steps. You can find these in
. -
Synthesize from a checkpoint
python3 --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
Replace "185000" with the checkpoint number that you want to use, then open a browser to
and type what you want to speak. Alternately, you can run at the command line:python3 --checkpoint ~/tacotron/logs-tacotron/model.ckpt-185000
TCMalloc seems to improve training speed and avoids occasional slowdowns seen with the default allocator. You can enable it by installing it and setting
. -
You can train with CMUDict by downloading the dictionary to ~/tacotron/training and then passing the flag
to This will allow you to pass ARPAbet phonemes enclosed in curly braces at eval time to force a particular pronunciation, e.g.Turn left on {HH AW1 S S T AH0 N} Street.
If you pass a Slack incoming webhook URL as the
flag to, it will send you progress updates every 1000 steps. -
Occasionally, you may see a spike in loss and the model will forget how to attend (the alignments will no longer make sense). Although it will recover eventually, it may save time to restart at a checkpoint prior to the spike by passing the
flag to (replacing 150000 with a step number prior to the spike). Update: a recent fix to gradient clipping by @candlewill may have fixed this.
- By Alex Barron:
- By Kyubyong Park: