亮点🔥
Fun-CosyVoice 3.0是基于大语言模型(LLM)的先进文本转语音(TTS)系统,在内容一致性、说话人相似度和韵律自然度上全面超越前代(CosyVoice 2.0)。该系统专为开放场景下的零样本多语言语音合成而设计。
核心特性
- 语言覆盖:支持9种常用语言(中、英、日、韩、德、西、法、意、俄)及18+种汉语方言/口音(广东话、闽南语、四川话、东北话、陕西话、山西话、上海话、天津话、山东话、宁夏话、甘肃话等),同时支持多语言/跨语言零样本音色克隆。
- 内容一致性&自然度:在文本还原度、音色相似度和韵律流畅性方面达到业界领先水平。
- 发音修复:支持中文拼音和英文CMU音素的发音校正,提供更强可控性,满足生产级需求。
- 文本归一化:无需传统前端模块即可正确朗读数字、特殊符号及各类文本格式。
- 双流式处理:同时支持文本输入流与音频输出流,在保持高质量音频输出的情况下实现最低150ms的延迟。
- 指令控制:支持语言、方言、情感、语速、音量等多种调节指令。
路线图
2025年12月
- 发布Fun-CosyVoice3-0.5B-2512基础模型、强化学习模型及其训练/推理脚本
- 发布Fun-CosyVoice3-0.5B modelscope gradio空间
2025年8月
- 感谢NVIDIA张悦铠的贡献,新增了triton trtllm运行时支持以及cosyvoice2 grpo训练支持
2025年7月
- 发布Fun-CosyVoice 3.0评估集
2025年5月
- 添加CosyVoice2-0.5B vllm支持
2024年12月
- 发布25hz CosyVoice2-0.5B版本
2024年9月
- 25hz CosyVoice-300M基础模型
- 25hz CosyVoice-300M语音转换功能
2024年8月
- 采用重复感知采样(RAS)推理提升大语言模型稳定性
- 支持流式推理模式,包括用于实时率优化的kv缓存和sdpa技术
2024年7月
- 支持流匹配训练
- 当ttsfrd不可用时支持WeTextProcessing
- Fastapi服务端与客户端
评估
| Model | Open-Source | Model Size | test-zh CER (%) ↓ | test-zh Speaker Similarity (%) ↑ | test-en WER (%) ↓ | test-en Speaker Similarity (%) ↑ | test-hard CER (%) ↓ | test-hard Speaker Similarity (%) ↑ |
|---|---|---|---|---|---|---|---|---|
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
安装
克隆与安装
克隆仓库
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git # If you failed to clone the submodule due to network failures, please run the following command until success cd CosyVoice git submodule update --init --recursive安装 Conda:请参阅 https://docs.conda.io/en/latest/miniconda.html
创建 Conda 环境:
conda create -n cosyvoice -y python=3.10 conda activate cosyvoice pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com # If you encounter sox compatibility issues # ubuntu sudo apt-get install sox libsox-dev # centos sudo yum install sox sox-devel
模型下载
fromhuggingface_hubimportsnapshot_download snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512',local_dir='pretrained_models/Fun-CosyVoice3-0.5B')snapshot_download('FunAudioLLM/CosyVoice-ttsfrd',local_dir='pretrained_models/CosyVoice-ttsfrd')可选地,您可以解压ttsfrd资源并安装ttsfrd包以获得更好的文本规范化性能。
请注意此步骤并非必需。若不安装ttsfrd包,我们将默认使用wetext。
cd pretrained_models/CosyVoice-ttsfrd/ unzip resource.zip -d . pip install ttsfrd_dependency-0.1-py3-none-any.whl pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl基本用法
importsys sys.path.append('third_party/Matcha-TTS')fromcosyvoice.cli.cosyvoiceimportAutoModelimporttorchaudio""" CosyVoice3 Usage, check https://funaudiollm.github.io/cosyvoice3/ for more details """cosyvoice=AutoModel(model_dir='pretrained_models/Fun-CosyVoice3-0.5B')# en zero_shot usagefori,jinenumerate(cosyvoice.inference_zero_shot('CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities.','You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。','./asset/zero_shot_prompt.wav',stream=False)):torchaudio.save('zero_shot_{}.wav'.format(i),j['tts_speech'],cosyvoice.sample_rate)# zh zero_shot usagefori,jinenumerate(cosyvoice.inference_zero_shot('八百标兵奔北坡,北坡炮兵并排跑,炮兵怕把标兵碰,标兵怕碰炮兵炮。','You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。','./asset/zero_shot_prompt.wav',stream=False)):torchaudio.save('zero_shot_{}.wav'.format(i),j['tts_speech'],cosyvoice.sample_rate)# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L280fori,jinenumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>[breath]因为他们那一辈人[breath]在乡里面住的要习惯一点,[breath]邻居都很活络,[breath]嗯,都很熟悉。[breath]','./asset/zero_shot_prompt.wav',stream=False)):torchaudio.save('fine_grained_control_{}.wav'.format(i),j['tts_speech'],cosyvoice.sample_rate)# instruct usage, for supported control, check cosyvoice/utils/common.py#L28fori,jinenumerate(cosyvoice.inference_instruct2('好少咯,一般系放嗰啲国庆啊,中秋嗰啲可能会咯。','You are a helpful assistant. 请用广东话表达。<|endofprompt|>','./asset/zero_shot_prompt.wav',stream=False)):torchaudio.save('instruct_{}.wav'.format(i),j['tts_speech'],cosyvoice.sample_rate)fori,jinenumerate(cosyvoice.inference_instruct2('收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。','You are a helpful assistant. 请用尽可能快地语速说一句话。<|endofprompt|>','./asset/zero_shot_prompt.wav',stream=False)):torchaudio.save('instruct_{}.wav'.format(i),j['tts_speech'],cosyvoice.sample_rate)# hotfix usagefori,jinenumerate(cosyvoice.inference_zero_shot('高管也通过电话、短信、微信等方式对报道[j][ǐ]予好评。','You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。','./asset/zero_shot_prompt.wav',stream=False)):torchaudio.save('hotfix_{}.wav'.format(i),j['tts_speech'],cosyvoice.sample_rate)致谢
- 我们借鉴了大量来自FunASR的代码。
- 我们借鉴了大量来自FunCodec的代码。
- 我们借鉴了大量来自Matcha-TTS的代码。
- 我们借鉴了大量来自AcademiCodec的代码。
- 我们借鉴了大量来自WeNet的代码。
引用文献
@article{du2024cosyvoice, title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens}, author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others}, journal={arXiv preprint arXiv:2407.05407}, year={2024} } @article{du2024cosyvoice, title={Cosyvoice 2: Scalable streaming speech synthesis with large language models}, author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others}, journal={arXiv preprint arXiv:2412.10117}, year={2024} } @article{du2025cosyvoice, title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training}, author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others}, journal={arXiv preprint arXiv:2505.17589}, year={2025} } @inproceedings{lyu2025build, title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice}, author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao}, booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1--2}, year={2025}, organization={IEEE} }