Skip to content

Latest commit

 

History

History
29 lines (15 loc) · 1.64 KB

le2023.md

File metadata and controls

29 lines (15 loc) · 1.64 KB
date tags
2023-06-27
paper,speech, generation

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

Link to the paper

Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

Meta report

Year: 2023

Samples: https://voicebox.metademolab.com/

This paper describes a non-autoregressive speech generation model that shows interesting tasks such as speech infill, generation or zero-shot voice cloning.

The generative model consists of a Continuous Normalizing Flow (CNF) that is trained with a Flow Matching method with optimal transport (to revise Lipman et al 2023). This method is, from a high level, a normalizing flow (hence invertible) where an ODE solver is used to generate samples at inference time, through an iterative approach. The authors claim that this type of model is faster to train than a diffusion model, and has faster generalization and better performance. They claim needing around 10 steps (NFE) at inference time. They also adapt other tricks from diffusion such as Classifier Free Guidance.

The architecture consists of two components, an audio generator, which generates mel-spectrograms, and a durations generator, which generates the durations of the phonemes.

This model is able to perform the following tasks: content editing through inpainting, denoising through inpainting, TTS synthesis and style transfer.