Is it possible to use audio>10 seconds for training/inference? #1780

yukiarimo · 2024-11-14T22:21:54Z

Hello. Is it possible to train (using the official google colab) the GPT-SoVITS-V2 with audios longer than 10 seconds, not splitting them?

Also, what about inference? Why it is limited to 3<audio<10?

The text was updated successfully, but these errors were encountered:

XXXXRT666 · 2024-11-15T09:48:30Z

Yes, there is a limit in config file
AR is not stable for long semantic or short semantic

yukiarimo · 2024-11-15T13:49:25Z

Can you please show where it is (that config file) so I can modify it?
What does AR stand for? And why it is not stable (just curious, cause the generation itself seems somewhat stable for the long sequences)?

XXXXRT666 · 2024-11-15T14:13:48Z

GPT_SoVITS/configs/s1longer-v2.yaml max_sec
Autoregressive. Error Accumulation & sensitivity to Input Noise, it's a strong sequence dependency progress

yukiarimo · 2024-11-15T19:14:27Z

Thank you, I’ll try! In the google colab do I need to additionally configure something or just change the parameters in the config file? And is there a maximum limit on the length for the training used and max text for speech for the model (with no split)?

Also, can you please explain some of these parameters in the config?

train:
  seed: 1234
  epochs: 20
  batch_size: 8
  save_every_n_epoch: 1
  precision: 16-mixed
  gradient_clip: 1.0
optimizer:
  lr: 0.01
  lr_init: 0.00001
  lr_end: 0.0001
  warmup_steps: 2000
  decay_steps: 40000
data:
  max_eval_sample: 8
  max_sec: 54
  num_workers: 4
  pad_val: 1024 # same with EOS in model

By the way, do I need to fine-tune two other models (the last two) are what are they responsible for?

from gpt_sovits import TTS, TTS_Config

soviets_configs = {
    "default": {
        "device": "cuda",  #  ["cpu", "cuda"]
        "is_half": True,  #  Set 'False' if you will use cpu
        "t2s_weights_path": "pretrained_models/s1bert25hz-2kh-longer-epoch=68e-step=50232.ckpt", # gpt model
        "vits_weights_path": "pretrained_models/s2G488k.pth",
        "cnhuhbert_base_path": "pretrained_models/chinese-hubert-base",
        "bert_base_path": "pretrained_models/chinese-roberta-wwm-ext-large"
    }
}

tts_config = TTS_Config(soviets_configs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to use audio>10 seconds for training/inference? #1780

Is it possible to use audio>10 seconds for training/inference? #1780

yukiarimo commented Nov 14, 2024 •

edited

Loading

XXXXRT666 commented Nov 15, 2024

yukiarimo commented Nov 15, 2024

XXXXRT666 commented Nov 15, 2024

yukiarimo commented Nov 15, 2024 •

edited

Loading

Is it possible to use audio>10 seconds for training/inference? #1780

Is it possible to use audio>10 seconds for training/inference? #1780

Comments

yukiarimo commented Nov 14, 2024 • edited Loading

XXXXRT666 commented Nov 15, 2024

yukiarimo commented Nov 15, 2024

XXXXRT666 commented Nov 15, 2024

yukiarimo commented Nov 15, 2024 • edited Loading

yukiarimo commented Nov 14, 2024 •

edited

Loading

yukiarimo commented Nov 15, 2024 •

edited

Loading