简体中文 | English
[TOC]
- 2021/04/20 Merge from wavegan branch to main branch, delete wavegan branch!
- 2021/04/13 Create encoder branch to dev the speech style transfer module!
- 2021/04/13 softdtw branch support SoftDTW loss!
- 2021/04/09
wavegan branchsupport PWG / MelGAN / Multi-band MelGAN vocoder! - 2021/04/05 Support ParallelText2Mel + MelGAN vocoder!
- [ Key Info ] Speed Indicator, Samples, Web Demo, [Few Issues](#Few Issues), Communication ......
.
|--- config/ # config file
|--- default.yaml
|--- ...
|--- datasets/ # data process
|--- encoder/ # voice encoder
|--- voice_encoder.py
|--- ...
|--- helpers/ # some helpers
|--- trainer.py
|--- synthesizer.py
|--- ...
|--- logdir/ # training log directory
|--- losses/ # loss function
|--- models/ # synthesizor
|--- layers.py
|--- duration.py
|--- parallel.py
|--- pretrained/ # pretrained (LJSpeech dataset)
|--- samples/ # synthesized samples
|--- utils/ # some common utils
|--- vocoder/ # vocoder
|--- melgan.py
|--- ...
|--- wandb/ # Wandb save directory
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py # prepare dataset
|--- README.md
|--- requirements.txt # dependencies
|--- synthesize.py # synthesize script
|--- train-duration.py # train script
|--- train-parallel.py
Here are some synthesized samples.
Here are some pretrained models.
Step (1):clone repo
$ git clone https://github.com/atomicoo/ParallelTTS.git
Step (2):install dependencies
$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt
Step (3):synthesize audio
$ python synthesize.py \
--checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth \
--melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth \
--input_texts ./samples/english/synthesize.txt \
--outputs_dir ./outputs/
If synthesizing audio of other languages, you should set config file through --config
.
Step (1):prepare dataset
$ python prepare-dataset.py
Through --config
to set config file, default (default.yaml
) is for LJSpeech dataset.
Step (2):train alignment model
$ python train-duration.py
Step (3):extract durations
$ python extract-duration.py
Through --ground_truth
to set weather generating ground-truth spectrograms or not。
Step (4):train synthesize model
$ python train-parallel.py
Through --ground_truth
to set weather training model by ground-truth spectrograms。
if use TensorBoardX, run this:
$ tensorboard --logdir logdir/[DIR]/
It is highly recommended to use Wandb(Weights & Biases), just set --enable_wandb
when training。
- LJSpeech: English, Female, 22050 Hz, ~24 h
- LibriSpeech: English, Multi-speakers (only use audios of train-clean-100),16000 Hz,total ~1000 h
- JSUT: Japanese, Female, 48000 Hz, ~10 h
- BiaoBei: Mandarin, Female, 48000 Hz, ~12 h
- KSS: Korean, Female, 44100 Hz, ~12 h
- RuLS: Russian, Multi-speakers (only use audios of single speaker), 16000 Hz, total ~98 h
- TWLSpeech (non-public, poor quality): Tibetan, Female (multi-speakers, sound similar), 16000 Hz,~23 h
TODO: to be added.
Speed of Training:LJSpeech dataset, batch size = 64, training on 8GB GTX 1080 GPU, elapsed ~8h (~300 epochs).
Speed of Synthesizing:test under CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150, 8s per synthesized audio (about 20 words)
Batch Size | Spec (GPU) |
Audio (GPU) |
Spec (CPU) |
Audio (CPU) |
---|---|---|---|---|
1 | 0.042 | 0.218 | 0.100 | 2.004 |
2 | 0.046 | 0.453 | 0.209 | 3.922 |
4 | 0.053 | 0.863 | 0.407 | 7.897 |
8 | 0.062 | 2.386 | 0.878 | 14.599 |
Attention, no multiple tests, for reference only.
- In wavegan branch, code of
vocoder
is from ParallelWaveGAN. Since the method of acoustic feature extraction is not compatible, it needs to be transformed. See here. - The input of mandarin model is pinyin. Because of the lack of punctuations in BiaoBei's raw pinyin sequence and the incomplete alignment model training, there's something wrong with the rhythm of synthesized samples.
- I haven't trained a Korean vocoder specially, and just use the vocoder of LJSpeech (22050 Hz), which might slightly affect the quality of synthesized audio.
- Kyubyong/tacotron
- r9y9/deepvoice3_pytorch
- tugstugi/pytorch-dc-tts
- janvainer/speedyspeech
- Po-Hsun-Su/pytorch-ssim
- Maghoumi/pytorch-softdtw-cuda
- seungwonpark/melgan
- kan-bayashi/ParallelWaveGAN
- Synthetic speech quality assessment (MOS)
- More tests in different languages
- Speech style transfer (tone)
- E-mail: atomicoo95@gmail.com