Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to use audio>10 seconds for training/inference? #1780

Open
yukiarimo opened this issue Nov 14, 2024 · 4 comments
Open

Is it possible to use audio>10 seconds for training/inference? #1780

yukiarimo opened this issue Nov 14, 2024 · 4 comments

Comments

@yukiarimo
Copy link

yukiarimo commented Nov 14, 2024

Hello. Is it possible to train (using the official google colab) the GPT-SoVITS-V2 with audios longer than 10 seconds, not splitting them?

Also, what about inference? Why it is limited to 3<audio<10?

@XXXXRT666
Copy link
Contributor

  1. Yes, there is a limit in config file
  2. AR is not stable for long semantic or short semantic

@yukiarimo
Copy link
Author

  1. Can you please show where it is (that config file) so I can modify it?
  2. What does AR stand for? And why it is not stable (just curious, cause the generation itself seems somewhat stable for the long sequences)?

@XXXXRT666
Copy link
Contributor

  1. GPT_SoVITS/configs/s1longer-v2.yaml max_sec
  2. Autoregressive. Error Accumulation & sensitivity to Input Noise, it's a strong sequence dependency progress

@yukiarimo
Copy link
Author

yukiarimo commented Nov 15, 2024

Thank you, I’ll try! In the google colab do I need to additionally configure something or just change the parameters in the config file? And is there a maximum limit on the length for the training used and max text for speech for the model (with no split)?

Also, can you please explain some of these parameters in the config?

train:
  seed: 1234
  epochs: 20
  batch_size: 8
  save_every_n_epoch: 1
  precision: 16-mixed
  gradient_clip: 1.0
optimizer:
  lr: 0.01
  lr_init: 0.00001
  lr_end: 0.0001
  warmup_steps: 2000
  decay_steps: 40000
data:
  max_eval_sample: 8
  max_sec: 54
  num_workers: 4
  pad_val: 1024 # same with EOS in model

By the way, do I need to fine-tune two other models (the last two) are what are they responsible for?

from gpt_sovits import TTS, TTS_Config

soviets_configs = {
    "default": {
        "device": "cuda",  #  ["cpu", "cuda"]
        "is_half": True,  #  Set 'False' if you will use cpu
        "t2s_weights_path": "pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt", # gpt model
        "vits_weights_path": "pretrained_models/s2G488k.pth",
        "cnhuhbert_base_path": "pretrained_models/chinese-hubert-base",
        "bert_base_path": "pretrained_models/chinese-roberta-wwm-ext-large"
    }
}

tts_config = TTS_Config(soviets_configs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants