Adding new StyleTTS2 to the WebUI #212

Aamir3d · 2023-11-08T22:48:35Z

Aamir3d
Nov 8, 2023

I came across today thanks to @aedocw of epub2tts

https://github.com/yl4579/StyleTTS2
This looks promising. I wonder if we can integrate this into the WebUI in the future.

rsxdalv · 2023-11-08T23:17:14Z

rsxdalv
Nov 8, 2023
Maintainer

Thanks, it's on my radar. Is it finished? I remember when I first looked at it, it was still not in a usable stage. I might be able to work on it next month.

1 reply

Aamir3d Nov 8, 2023
Author

Thanks, it's on my radar. Is it finished? I remember when I first looked at it, it was still not in a usable stage. I might be able to work on it next month.

Unsure - I was having some issues with the epub2tts implementation due to coqui ai and started a mini-chat with @aedocw (very smart individual).
I feel the Audio WebUI built by you has a lot of functionality and integrating better quality audio will enhance it further. (And hopefully get an epub, txt and pdf converter!). Perhaps some code from the epub2tts repo can be reused.

ehartford · 2023-11-20T17:40:22Z

ehartford
Nov 20, 2023

I came here to ask this.
https://styletts2.github.io

0 replies

78Alpha · 2023-12-01T14:26:24Z

78Alpha
Dec 1, 2023

It works but has some high memory requirements for training.

It didn't split by token limit either so some limiting was needed.

0 replies

rsxdalv · 2023-12-01T14:51:54Z

rsxdalv
Dec 1, 2023
Maintainer

Thanks for the insight! Quick question that I haven't fully understood about styletts - how often would you need to be training it?

…

On Fri, Dec 1, 2023, 10:26 PM 78Alpha ***@***.***> wrote: It works but has some high memory requirements for training. It didn't split by token limit either so some limiting was needed. — Reply to this email directly, view it on GitHub <#212 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTRXIZGXLJQUZFI63MSMFLYHHSJXAVCNFSM6AAAAAA7DUIJB2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TOMZQGA4DG> . You are receiving this because you commented.Message ID: ***@***.*** .com>

2 replies

78Alpha Dec 2, 2023

As it is, there's a concept model but no real base model. It would be training a base from scratch and then fine tuning after that. I only did the fine tuning, but it still used over 24 GB VRAM at batch size 2 (it was strongly recommended to never use 1).

It stars at 19 GB but then grows when the additional features kick in.

They added accelerate, but even with mixed precision, it was still 19. Maybe just a miss somewhere?

Inference is fast though. the longest part was loading the models.

78Alpha Dec 5, 2023

Memory leak appears to have been fixed with the addition of accelerate. It stays at a constant 20 GB for Batch Size 2 and Len 200 (800 is ideal, less causes short stops, but needs a lot more memory). Current finetuning time for 50 Epochs (40 Minutes of Audio) appears to be over 12 hours. Still don't have a model to test with because it takes so long, but I can get some basic inference numbers and a rough comparison to the alternative (Tortoise + RVC).

EDIT:

Inference used about 3 GB VRAM and took 10 Seconds (Including model loading from SSD), the result, however, was absolutely terrible.

Baseline.mp4

TortoiseRVC.mp4

StyleTTS2.mp4

StyleTTS2RVC.mp4

ehartford · 2023-12-01T16:22:42Z

ehartford
Dec 1, 2023

You have to tune each voice On Fri, Dec 1, 2023, 6:52 AM Roberts Slisans ***@***.***> wrote:

…

Thanks for the insight! Quick question that I haven't fully understood about styletts - how often would you need to be training it? On Fri, Dec 1, 2023, 10:26 PM 78Alpha ***@***.***> wrote: > It works but has some high memory requirements for training. > > It didn't split by token limit either so some limiting was needed. > > — > Reply to this email directly, view it on GitHub > < #212 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/ABTRXIZGXLJQUZFI63MSMFLYHHSJXAVCNFSM6AAAAAA7DUIJB2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TOMZQGA4DG> > . > You are receiving this because you commented.Message ID: > ***@***.*** > .com> > — Reply to this email directly, view it on GitHub <#212 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIQ4BKDZOTQC66XQLXKJGLYHHVKXAVCNFSM6AAAAAA7DUIJB2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TOMZQGM3DO> . You are receiving this because you commented.Message ID: ***@***.*** .com>

0 replies

rsxdalv · 2023-12-02T14:07:50Z

rsxdalv
Dec 2, 2023
Maintainer

That's a solid amount of VRAM. I'm guessing even for inference you'd need a similar amount? In that case, it sounds like the way to go for most people would be CPU inference. And it does sound like a miss that even with optimizations it's the same size.

…

On Sat, Dec 2, 2023, 10:02 PM 78Alpha ***@***.***> wrote: As it is, there's a concept model but no real base model. It would be training a base from scratch and then fine tuning after that. I only did the fine tuning, but it still used over 24 GB VRAM at batch size 2 (it was strongly recommended to never use 1). It stars at 19 GB but then grows when the additional features kick in. They added accelerate, but even with mixed precision, it was still 19. Maybe just a miss somewhere? Inference is fast though. the longest part was loading the models. — Reply to this email directly, view it on GitHub <#212 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTRXI5P4DIPVOFAGSIJQZTYHMYHBAVCNFSM6AAAAAA7DUIJB2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TOMZYGE2TO> . You are receiving this because you commented.Message ID: ***@***.*** .com>

0 replies

Aamir3d · 2023-12-02T15:11:30Z

Aamir3d
Dec 2, 2023
Author

Was just going through the repo - it has unusually high requirements - there are some VRAM and usage tips given in this ticket.
yl4579/StyleTTS2#81
This is likely not to run on a consumer mid-level GPU unless there are other optimizations made for it.

1 reply

78Alpha Dec 6, 2023

Inference only needs about 3 GB of VRAM. The training and need of training is what kills the dream.

rsxdalv · 2023-12-06T03:13:18Z

rsxdalv
Dec 6, 2023
Maintainer

Interesting. It sounds like it could be workable. In this case, we'd need at least a notebook for training, maybe some more integrations to make it easy to rent a GPU for the training. My personal workstation has only 8GB, so I'm looking at renting myself. Meanwhile, maybe styletts can optimize memory.

…

On Wed, Dec 6, 2023, 10:49 AM 78Alpha ***@***.***> wrote: Inference only needs about 3 GB of VRAM. The training and need of training is what kills the dream. — Reply to this email directly, view it on GitHub <#212 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABTRXI5IYEL6SSM7HWQPXU3YH7MLPAVCNFSM6AAAAAA7DUIJB2VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TONZRGAZDM> . You are receiving this because you commented.Message ID: ***@***.*** .com>

0 replies

rsxdalv · 2024-01-16T19:53:24Z

rsxdalv
Jan 16, 2024
Maintainer

So here's what's putting a big freeze on it - software licences. This wasn't obvious before but now it is - StyleTTS2 "as demonstrated" relies on phonemizer which is GPL. Although there are discussions and ways to sidestep that, it hasn't been resolved.
yl4579/StyleTTS2#91
Although I can use this fork and make StyleTTS2 installable (the official version isn't even installable without modifications), I still can't install the "good" phoneme decoder without breaking licenses. I could only automatically install an alternative, and from what I read, and this seems to be a global problem for current TTS projects that need text-to-phonemes, the alternatives aren't working that well.
If there's a way to use StyleTTS2 without needing a phonemizer, or if someone can say that alternative XYZ is good to go, it might be worth the while to make this.

6 replies

rsxdalv Jan 16, 2024
Maintainer

It would force my entire project to change license. In fact, that's why StyleTTS2 does not include that package by default, so that they can keep their code MIT. But a potential user can put those 2 together and get GPL "product".

ehartford Jan 17, 2024

Ah ok that makes sense

gr1336 Feb 12, 2024

I recently discovered an alternative repository by sidharthrajaram from sidharthrajaram/StyleTTS2. What is interesting about that project is that it doesn't rely on espeak. The installation process is straightforward with pip install styletts2. Instead of espeak, it integrates with gruut from rhasspy/gruut. The phonemizer gruut is a MIT licenced project that does have a good quality in its phonemes, though sometimes some of the phonemes are or missing or incompatible with the current model, but those artefacts not frequent occurrence, it can be installed easily as pip install gruut.

Some of the sidharthrajaram's project interesting notes are:

This package makes StyleTTS2, an approach to human-level text-to-speech, accessible with an inference module that uses strictly MIT licensed libraries.

Common Issues:

Voice quality: This is more of a catch-all issue for voice quality related issues. In most cases, strange annunciations are the result of the phoneme converter. The hope is that the field of MIT licensed phoneme converters (i.e Gruut, DeepPhonemizer, etc.) will eventually become incredibly competitive with the legacy converters such as espeak.
- However, in the meantime here are some potential avenues for quality improvement:
  - More phonetically diverse target voice samples for cloning: The WAV file passed as the target/reference voice should preferably have a good range of pronunciations and be of good audio quality. In experimenting with cloning, I've noticed that the speech output quality does improve alongside the quality of the target/reference voice sample.

rsxdalv Feb 12, 2024
Maintainer

Wow

rsxdalv Mar 7, 2024
Maintainer

@gr1336 thank you for the analysis and details! Although it's taking time, a lot of time, I am working towards getting it into the project. Right now I'm checking if the torch versions can be made to match as it would make things a lot more stable and easier on my end.

Aamir3d · 2024-02-28T21:24:43Z

Aamir3d
Feb 28, 2024
Author

I also came across this new Openvoice going around. Is the license feasible for integration in the WebUI?
https://github.com/myshell-ai/OpenVoice

5 replies

rsxdalv Feb 28, 2024
Maintainer

Creative Commons Attribution-NonCommercial 4.0 International Public
License

This license is quite incompatible. Maybe if there was a plugin API you could make your own plugin for it, although NonCommercial licenses have an increased risk for almost anything you might do with them.

Aamir3d Feb 28, 2024
Author

Thanks - they did write that it'll change to a free commercial one in the future. Let's see.
Thank you as always.

rsxdalv Feb 29, 2024
Maintainer

I subscribed to the issue where they claim that they will change it. Hopefully I get the news once it's open source. FYI - right now, it's open source in the sense that you can see the code but not open source in the "developers can and like to use this" sense. The most recent commercialization wave of open source has made it more and more popular for companies to do this. It's basically like they are leaking their own code rather than releasing something as open source.

ehartford Feb 29, 2024

Can you please link the issue?

rsxdalv Oct 19, 2024
Maintainer

The license is now fully open source. I have created an issue for tracking when it is being added: #397

I think there are some other great models around at this point; however it still seems like a valuable project to add so it is in the queue.

I will close this discussion since StyleTTS2 has been added (although it needs more features) and OpenVoice can be discussed separately in the new issue.

rsxdalv · 2024-02-29T08:06:10Z

rsxdalv
Feb 29, 2024
Maintainer

Can you please link the issue?

@ehartford myshell-ai/OpenVoice#114 (comment)

Edit: ah GitHub did the bug where quoting something doesn't make it a reply.

0 replies

rsxdalv · 2024-04-05T23:17:34Z

rsxdalv
Apr 5, 2024
Maintainer

StyleTTS2 has been upgraded to a gradio demo.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding new StyleTTS2 to the WebUI #212

{{title}}

Replies: 12 comments 15 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Adding new StyleTTS2 to the WebUI #212

Replies: 12 comments · 15 replies

rsxdalv Nov 8, 2023 Maintainer

Aamir3d Nov 8, 2023 Author

rsxdalv Dec 1, 2023 Maintainer

rsxdalv Dec 2, 2023 Maintainer

Aamir3d Dec 2, 2023 Author

rsxdalv Dec 6, 2023 Maintainer

rsxdalv Jan 16, 2024 Maintainer

rsxdalv Jan 16, 2024 Maintainer

Some of the sidharthrajaram's project interesting notes are:

Common Issues:

rsxdalv Feb 12, 2024 Maintainer

rsxdalv Mar 7, 2024 Maintainer

Aamir3d Feb 28, 2024 Author

rsxdalv Feb 28, 2024 Maintainer

Aamir3d Feb 28, 2024 Author

rsxdalv Feb 29, 2024 Maintainer

rsxdalv Oct 19, 2024 Maintainer

rsxdalv Feb 29, 2024 Maintainer

rsxdalv Apr 5, 2024 Maintainer

Replies: 12 comments 15 replies

rsxdalv
Nov 8, 2023
Maintainer

Aamir3d Nov 8, 2023
Author

rsxdalv
Dec 1, 2023
Maintainer

rsxdalv
Dec 2, 2023
Maintainer

Aamir3d
Dec 2, 2023
Author

rsxdalv
Dec 6, 2023
Maintainer

rsxdalv
Jan 16, 2024
Maintainer

rsxdalv Jan 16, 2024
Maintainer

rsxdalv Feb 12, 2024
Maintainer

rsxdalv Mar 7, 2024
Maintainer

Aamir3d
Feb 28, 2024
Author

rsxdalv Feb 28, 2024
Maintainer

Aamir3d Feb 28, 2024
Author

rsxdalv Feb 29, 2024
Maintainer

rsxdalv Oct 19, 2024
Maintainer

rsxdalv
Feb 29, 2024
Maintainer

rsxdalv
Apr 5, 2024
Maintainer