date | tags |
---|---|
2023-06-04 |
paper, deep learning, hifi, speech, codec, vqvae |
Dongchao Yang, Songxiang Liu, Rongjie Huang, Jinchuan Tian, Chao Weng, Yuexian Zou
arXiv Preprint
Year: 2023
The authors of this study introduce another VQ-based codec, specially designed for speech generation. For that, they focus on high quality reconstructions with low number of codebooks.
To achieve that, they introduce a new technique that they call Group Residual Vector Quantization. In this case, instead of stacking quantizations and codebooks in cascade (residual-VQs), they propose to do the same in parallel. The method consists of splitting the continuous latent vector that conforms the output of the encoder (
The motivation of this approach is that Encodec/RVQ/Soundstream models encode many information into the first codebook (content, pitch, prosody, etc), and then use the rest of codebooks to sparsely encode minor details. By splitting the representation from the beginning, they claim that the information is better distributed across codebooks.
Although not directly mentioned in the paper, the most interesting property of this approach is that the
The following chart shows the architecture of HiFi codec.