It seems you haven't specified your code repository, and some questions of the model

by Genalp520 - opened Jan 13

Discussion

Genalp520

Jan 13

The repository only provides the model. What is the address of the code repository?

For example: scripts/onnx_inference_pure.py

Thanks

Genalp520

Jan 14

Hello, I see you've already updated your repository.
Please reply to this discussion once you've finished preparing, thank you.☺️

I have another question:
In the official cosyvoice3 examples, prompt_text supports inputting the style and emotion of the generated speech before <|endofprompt|>.
Does the model in this repository support this?

def cosyvoice3_example():
    """ CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
    """
    cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
    # zero_shot usage
    for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>希望你以后能够做的比我还好呦。',
                                                        './asset/zero_shot_prompt.wav', stream=False)):
        torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)

Genalp520 changed discussion title from It seems you haven't specified your code repository. to It seems you haven't specified your code repository, and some questions of the model Jan 14

ayousanz

Owner Jan 14

@Genalp520
Hello! I've finished the preparation.

As for the prompt_text, this repository does not support that feature. I tested it, but adding style or emotion tags before <|endofprompt|> interfered with the inference process, resulting in improper audio output.

Genalp520

Jan 15

@ayousanz
Thank you for your quick fix, the effect is fantastic! However, after testing, I found some issues that don't quite match your description:

When using Chinese prompt_wav, you must add <|endofprompt|> and prompt, otherwise the generated audio will be abnormal.
The style tag before <|endofprompt|> will produce some effect, but the effect is not as obvious as the original model.

Here are some examples:
prompt_wav:

Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. Please say a sentence as quick as possible.<|endofprompt|>八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。
Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. Please say a sentence as slow as possible.<|endofprompt|>八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。
Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. 我想体验一下小猪佩奇风格，可以吗？<|endofprompt|>八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。
Error: prompt appeared in the audio.
Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: 八百标兵奔北坡，北坡炮兵并排跑，炮兵怕把标兵碰，标兵怕碰炮兵炮。

ayousanz

Owner Jan 15

@Genalp520
Thank you so much for your detailed testing and feedback!

I'm surprised to learn that the style tags do have some effect. In my initial tests, the audio output was unstable when using them, so I assumed they were not properly supported. It is very helpful to know that they work to some extent and that <|endofprompt|> is actually required for Chinese prompts.

I really appreciate you pointing this out and providing these examples. This clarifies the model's behavior significantly.

Genalp520

Jan 16

@ayousanz

Thank you for your affirmation, the model's TTS effect is amazing. I have another question:
The majority of the time spent on TTS is in the LLM and Flow processes. Do you have plans to develop a lighter version, such as LLM and Flow INT8 version?

ayousanz

Owner Jan 19

@Genalp520
Current research on similar LLM architectures indicates that INT8 quantization leads to a loss in accuracy. Therefore, we generally plan to support up to FP16 for now.

However, if you are interested, I have actually experimented with the export in the branch below. Please feel free to try it out: https://github.com/ayutaz/CosyVoice/tree/feature/onnx-unity-export

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment