This is the official website for the paper
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
from Microsoft Applied Science Group and UC Berkeley
by Yatong Bai,
Trung Dang,
Dung Tran,
Kazuhito Koishida,
and Somayeh Sojoudi.
[🤗 Live Demo]
[Preprint Paper]
[Project Homepage]
[Code]
[Model Checkpoints]
[Generation Examples]
Our method reduce the computation of the core step of diffusion-based text-to-audio generation by a factor of 400, while observing minimal performance degradation in terms of Fréchet Audio Distance (FAD), Fréchet Distance (FD), KL Divergence, and CLAP Scores.
# queries (↓) | CLAPT (↑) | CLAPA (↑) | FAD (↓) | FD (↓) | KLD (↓) | |
---|---|---|---|---|---|---|
Diffusion (Baseline) | 400 | 24.57 | 72.79 | 1.908 | 19.57 | 1.350 |
Consistency + CLAP FT (Ours) | 1 | 24.69 | 72.54 | 2.406 | 20.97 | 1.358 |
Consistency (Ours) | 1 | 22.50 | 72.30 | 2.575 | 22.08 | 1.354 |
This benchmark demonstrates how our single-step models stack up with previous methods, most of which mostly require hundreds of generation steps.
@article{bai2023consistencytta,
title={ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation},
author={Bai, Yatong and Dang, Trung and Tran, Dung and Koishida, Kazuhito and Sojoudi, Somayeh},
journal={arXiv preprint arXiv:2309.10740},
year={2023}
}