It seems you haven't specified your code repository, and some questions of the model
The repository only provides the model. What is the address of the code repository?
For example: scripts/onnx_inference_pure.py
Thanks
Hello, I see you've already updated your repository.
Please reply to this discussion once you've finished preparing, thank you.☺️
I have another question:
In the official cosyvoice3 examples, prompt_text supports inputting the style and emotion of the generated speech before <|endofprompt|>.
Does the model in this repository support this?
def cosyvoice3_example():
""" CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details
"""
cosyvoice = AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')
# zero_shot usage
for i, j in enumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。', 'You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>希望你以后能够做的比我还好呦。',
'./asset/zero_shot_prompt.wav', stream=False)):
torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
@Genalp520
Hello! I've finished the preparation.
As for the prompt_text, this repository does not support that feature. I tested it, but adding style or emotion tags before <|endofprompt|> interfered with the inference process, resulting in improper audio output.
@ayousanz
Thank you for your quick fix, the effect is fantastic! However, after testing, I found some issues that don't quite match your description:
- When using Chinese prompt_wav, you must add
<|endofprompt|>andprompt, otherwise the generated audio will be abnormal. - The style tag before
<|endofprompt|>will produce some effect, but the effect is not as obvious as the original model.
Here are some examples:
prompt_wav:
- Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. Please say a sentence as quick as possible.<|endofprompt|>八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。 - Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. Please say a sentence as slow as possible.<|endofprompt|>八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。 - Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: You are a helpful assistant. 我想体验一下小猪佩奇风格,可以吗?<|endofprompt|>八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。 - Error: prompt appeared in the audio.
Input text: Hi everyone, the weather is so nice today! Shall we go on a picnic together?
Prompt audio: pretrained_models/Fun-CosyVoice3-0.5B/onnx/prompts/zero_shot_0.wav
Prompt text: 八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。
@Genalp520
Thank you so much for your detailed testing and feedback!
I'm surprised to learn that the style tags do have some effect. In my initial tests, the audio output was unstable when using them, so I assumed they were not properly supported. It is very helpful to know that they work to some extent and that <|endofprompt|> is actually required for Chinese prompts.
I really appreciate you pointing this out and providing these examples. This clarifies the model's behavior significantly.
@Genalp520
Current research on similar LLM architectures indicates that INT8 quantization leads to a loss in accuracy. Therefore, we generally plan to support up to FP16 for now.
However, if you are interested, I have actually experimented with the export in the branch below. Please feel free to try it out: https://github.com/ayutaz/CosyVoice/tree/feature/onnx-unity-export