CosyVoice 2025 Complete Guide: The Ultimate Multi-lingual Text-to-Speech Solution

🎯 Core Highlights (TL;DR)

State-of-the-art Performance: Fun-CosyVoice 3.0 achieves industry-leading content consistency (0.81% CER) and speaker similarity (77.4%) with only 0.5B parameters
Extensive Language Support: Covers 9 major languages and 18+ Chinese dialects with zero-shot voice cloning capability
Production-Ready Features: Bi-streaming support with ultra-low latency (150ms), pronunciation inpainting, and instruction-based control
Open-Source & Scalable: Fully open-source with complete training/inference/deployment pipeline and multiple runtime options (vLLM, TensorRT-LLM)

What is CosyVoice?
Key Features and Capabilities
Model Versions Comparison
Performance Benchmarks
Installation and Setup
Usage Guide
Deployment Options
Best Practices
FAQ

What is CosyVoice?

CosyVoice is an advanced Large Language Model (LLM)-based Text-to-Speech (TTS) system developed by FunAudioLLM. It represents a significant leap in zero-shot multilingual speech synthesis technology, enabling natural voice generation across multiple languages without requiring extensive training data for each speaker.

Evolution Timeline

The CosyVoice family has evolved through three major versions:

CosyVoice 1.0 (July 2024): Initial release with 300M parameters, establishing the foundation for scalable multilingual TTS
CosyVoice 2.0 (December 2024): Introduced streaming capabilities with 0.5B parameters and enhanced LLM architecture
Fun-CosyVoice 3.0 (December 2025): Current state-of-the-art with reinforcement learning optimization and in-the-wild speech generation

💡 Expert Insight
CosyVoice 3.0's use of supervised semantic tokens and flow matching training enables it to achieve human-like speech quality while maintaining computational efficiency—a critical balance for production deployments.

Key Features and Capabilities

🌍 Language Coverage

Supported Languages:

9 Major Languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
18+ Chinese Dialects: Guangdong (Cantonese), Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, and more

Cross-lingual Capabilities:

Zero-shot voice cloning across different languages
Multi-lingual speech synthesis from single prompt
Accent-preserving voice conversion

🎯 Advanced Technical Features

Feature	Description	Use Case
Pronunciation Inpainting	Support for Chinese Pinyin and English CMU phonemes	Precise control over pronunciation for brand names, technical terms
Bi-Streaming	Text-in and audio-out streaming	Real-time applications with 150ms latency
Instruct Support	Control language, dialect, emotion, speed, volume	Dynamic voice customization
Text Normalization	Automatic handling of numbers, symbols, formats	No frontend module required
RAS Inference	Repetition Aware Sampling for LLM stability	Prevents audio artifacts and repetitions

🚀 Performance Characteristics

Latency: As low as 150ms (streaming mode)
Model Size: 0.5B parameters (Fun-CosyVoice3)
Audio Quality: 25Hz sampling rate
Streaming: KV cache + SDPA optimization
Acceleration: 4x speedup with TensorRT-LLM

⚠️ Important Note
While CosyVoice 3.0 offers impressive capabilities, optimal performance requires GPU acceleration. CPU-only inference may result in significantly slower generation times.

Model Versions Comparison

Available Models

Model	Parameters	Best For	Key Advantage
Fun-CosyVoice3-0.5B-2512	0.5B	Production use, best overall quality	SOTA performance with RL optimization
Fun-CosyVoice3-0.5B-2512_RL	0.5B	Maximum accuracy	Lowest CER (0.81%) and WER (1.68%)
CosyVoice2-0.5B	0.5B	Streaming applications	Optimized for real-time synthesis
CosyVoice-300M	300M	Resource-constrained environments	Smaller footprint, good quality
CosyVoice-300M-SFT	300M	Supervised fine-tuning tasks	Pre-trained for specific voice styles
CosyVoice-300M-Instruct	300M	Instruction-based synthesis	Enhanced control capabilities

Version Selection Guide

Performance Benchmarks

Comprehensive Evaluation Results

The following table compares Fun-CosyVoice 3.0 against leading open-source and closed-source TTS systems:

Model	Open-Source	Size	test-zh CER (%) ↓	test-zh Speaker Sim (%) ↑	test-en WER (%) ↓	test-en Speaker Sim (%) ↑	test-hard CER (%) ↓	test-hard Speaker Sim (%) ↑
Human	-	-	1.26	75.5	2.14	73.4	-	-
Seed-TTS	❌	-	1.12	79.6	2.25	76.2	7.59	77.6
MiniMax-Speech	❌	-	0.83	78.3	1.65	69.2	-	-
F5-TTS	✅	0.3B	1.52	74.1	2.00	64.7	8.67	71.3
CosyVoice2	✅	0.5B	1.45	75.7	2.57	65.9	6.83	72.4
VoxCPM	✅	0.5B	0.93	77.2	1.85	72.9	8.87	73.0
GLM-TTS RL	✅	1.5B	0.89	76.4	-	-	-	-
Fun-CosyVoice3-0.5B-2512	✅	0.5B	1.21	78.0	2.24	71.8	6.71	75.8
Fun-CosyVoice3-0.5B-2512_RL	✅	0.5B	0.81	77.4	1.68	69.5	5.44	75.0

Key Performance Insights

✅ Best Practices

Content Accuracy: Fun-CosyVoice3 RL achieves 0.81% CER on Chinese test set, outperforming models 3x larger

Speaker Similarity: 78.0% similarity score approaches human-level performance (75.5%)

Challenging Scenarios: 5.44% CER on hard test set demonstrates robust handling of complex speech patterns

Efficiency: Achieves SOTA results with only 0.5B parameters vs. 1.5B+ competitors

Installation and Setup

Prerequisites

Operating System: Linux (Ubuntu/CentOS recommended)
Python Version: 3.10
GPU: NVIDIA GPU with CUDA support (recommended for optimal performance)
Conda: Miniconda or Anaconda

Step-by-Step Installation

1. Clone the Repository

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice

# If submodule cloning fails due to network issues
git submodule update --init --recursive

2. Create Conda Environment

conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

3. Install System Dependencies

# Ubuntu
sudo apt-get install sox libsox-dev

# CentOS
sudo yum install sox sox-devel

4. Download Pre-trained Models

For Hugging Face Users (Recommended for International Users):

from huggingface_hub import snapshot_download

# Download Fun-CosyVoice 3.0 (Recommended)
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', 
                   local_dir='pretrained_models/Fun-CosyVoice3-0.5B')

# Download CosyVoice 2.0
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', 
                   local_dir='pretrained_models/CosyVoice2-0.5B')

# Download text normalization resources
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', 
                   local_dir='pretrained_models/CosyVoice-ttsfrd')

For ModelScope Users (China Region):

from modelscope import snapshot_download

snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', 
                   local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('iic/CosyVoice2-0.5B', 
                   local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-ttsfrd', 
                   local_dir='pretrained_models/CosyVoice-ttsfrd')

5. Optional: Install Enhanced Text Normalization

For improved text normalization (especially for Chinese):

cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl

💡 Pro Tip
If you skip the ttsfrd installation, CosyVoice will automatically fall back to WeTextProcessing. While functional, ttsfrd provides better accuracy for number and symbol normalization.

Usage Guide

Quick Start with Web Demo

The fastest way to experience CosyVoice:

# Launch web interface
python3 webui.py --port 50000 --model_dir pretrained_models/Fun-CosyVoice3-0.5B

# For instruct mode
python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct

Access the interface at http://localhost:50000

Python API Usage

Basic Inference Example

from cosyvoice.cli.cosyvoice import CosyVoice

# Initialize model
cosyvoice = CosyVoice('pretrained_models/Fun-CosyVoice3-0.5B')

# Zero-shot voice cloning
prompt_speech = 'path/to/reference_audio.wav'
text = "Hello, this is a test of CosyVoice zero-shot synthesis."

for i, audio_chunk in enumerate(cosyvoice.inference_zero_shot(
    text, 
    prompt_text="Reference text spoken in the audio",
    prompt_speech=prompt_speech
)):
    # Save or process audio_chunk
    pass

Advanced Usage: Instruction-Based Synthesis

# Control emotion, speed, and other parameters
for audio in cosyvoice.inference_instruct(
    text="Your text here",
    speaker="default",
    instruct_text="Speak with excitement at a moderate pace"
):
    # Process audio
    pass

vLLM Acceleration (CosyVoice 2.0)

For maximum inference speed with CosyVoice 2.0:

Setup vLLM Environment

# Create separate environment for vLLM
conda create -n cosyvoice_vllm --clone cosyvoice
conda activate cosyvoice_vllm
pip install vllm==v0.9.0 transformers==4.51.3

Run vLLM Inference

python vllm_example.py

⚠️ Compatibility Note
vLLM v0.9.0 requires specific versions of PyTorch (2.7.0) and Transformers (4.51.3). Ensure your hardware supports these requirements before installation.

Deployment Options

Docker Deployment (Recommended for Production)

gRPC Server Deployment

cd runtime/python
docker build -t cosyvoice:v1.0 .

# Launch gRPC server
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && \
  python3 server.py --port 50000 --max_conc 4 \
  --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity"

# Test with client
cd grpc
python3 client.py --port 50000 --mode zero_shot

FastAPI Server Deployment

# Launch FastAPI server
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \
  /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && \
  python3 server.py --port 50000 \
  --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity"

# Test with client
cd fastapi
python3 client.py --port 50000 --mode sft

TensorRT-LLM Deployment (4x Acceleration)

For maximum performance with CosyVoice 2.0:

cd runtime/triton_trtllm
docker compose up -d

Performance Comparison:

Runtime	Relative Speed	Use Case
HuggingFace Transformers	1x (baseline)	Development, testing
vLLM	2-3x	Production with moderate load
TensorRT-LLM	4x	High-throughput production

✅ Deployment Best Practice

Development: Use web demo or Python API

Small-scale Production: FastAPI with Docker

Large-scale Production: TensorRT-LLM with load balancing

Real-time Applications: vLLM or TensorRT-LLM with streaming

Best Practices

Model Selection Strategy

📊 Decision Framework:

1. Quality Priority → Fun-CosyVoice3-0.5B-2512_RL
2. Balanced Performance → Fun-CosyVoice3-0.5B-2512
3. Real-time Streaming → CosyVoice2-0.5B + vLLM
4. Resource Constraints → CosyVoice-300M
5. Custom Control → CosyVoice-300M-Instruct

Optimization Tips

For Low Latency

Enable Streaming Mode: Use bi-streaming for text-in and audio-out
KV Cache: Ensure KV cache is enabled in inference config
SDPA Optimization: Utilize Scaled Dot-Product Attention
Batch Processing: Group similar-length inputs

For High Quality

Use RL Model: Fun-CosyVoice3-0.5B-2512_RL for maximum accuracy
Provide Clear Prompts: High-quality reference audio (3-10 seconds)
Text Normalization: Install ttsfrd for better preprocessing
Pronunciation Control: Use pinyin/phoneme inpainting for critical terms

For Multilingual Applications

Language-Specific Prompts: Provide reference audio in target language
Cross-lingual Cloning: Use instruct mode to specify target language
Dialect Support: Leverage 18+ Chinese dialect capabilities
Mixed Language: Segment text by language for optimal results

Common Pitfalls to Avoid

⚠️ Warning: Common Issues

Insufficient GPU Memory: 0.5B models require ~8GB VRAM minimum

Poor Reference Audio: Background noise or multiple speakers degrade cloning

Text Format Issues: Ensure proper encoding (UTF-8) for non-English text

Version Mismatch: vLLM compatibility requires specific package versions

Network Timeouts: Use ModelScope mirrors in China region

🤔 Frequently Asked Questions

Q: What's the difference between CosyVoice 2.0 and 3.0?

A: Fun-CosyVoice 3.0 introduces several key improvements:

Reinforcement Learning Optimization: RL-trained model achieves 0.81% CER vs. 1.45% in v2.0
Enhanced Naturalness: Improved prosody and speaker similarity through post-training
In-the-wild Performance: Better handling of challenging real-world scenarios (5.44% vs. 6.83% CER on hard test set)
Pronunciation Control: Advanced pinyin/phoneme inpainting capabilities

Q: Can I use CosyVoice for commercial applications?

A: Yes, CosyVoice is open-source and available for commercial use. However:

Review the license terms in the GitHub repository
Ensure compliance with voice cloning regulations in your jurisdiction
The disclaimer states content is for academic purposes; verify production use rights
Consider ethical implications of voice cloning technology

Q: How much GPU memory do I need?

A: Memory requirements vary by model:

CosyVoice-300M: ~4-6GB VRAM
CosyVoice2-0.5B: ~6-8GB VRAM
Fun-CosyVoice3-0.5B: ~8-10GB VRAM
Batch Inference: Add 2-4GB per additional concurrent request

For CPU-only inference, expect 16GB+ RAM and significantly slower speeds (10-50x slower).

Q: Which languages are best supported?

A: Based on evaluation data:

Excellent: Chinese (Mandarin), English
Very Good: Japanese, Korean
Good: German, Spanish, French, Italian, Russian
Dialects: 18+ Chinese dialects with varying quality

English and Chinese have the most extensive training data and achieve the best results.

Q: How do I improve voice cloning quality?

A: Follow these guidelines:

Reference Audio Quality:
- Duration: 3-10 seconds optimal
- Single speaker only
- Clear speech, minimal background noise
- Natural speaking pace
Prompt Text Accuracy:
- Provide exact transcription of reference audio
- Match language and dialect
Model Selection:
- Use Fun-CosyVoice3-0.5B-2512_RL for best quality
- Consider fine-tuning for specific voices
Post-processing:
- Apply noise reduction if needed
- Normalize audio levels

Q: Can I fine-tune CosyVoice on my own data?

A: Yes, the repository includes training scripts in examples/libritts/cosyvoice/run.sh. Requirements:

High-quality paired audio-text data
GPU cluster (multi-GPU recommended)
Familiarity with flow matching training
See the paper for detailed training methodology

Q: What's the best deployment option for my use case?

A: Choose based on your requirements:

Scenario	Recommended Setup	Rationale
Research/Testing	Web demo or Python API	Easy setup, full features
Small API (<100 req/day)	FastAPI + Docker	Simple deployment, good performance
Medium API (100-10K req/day)	vLLM + Load Balancer	2-3x speedup, scalable
High-throughput (>10K req/day)	TensorRT-LLM + Kubernetes	4x speedup, enterprise-grade
Real-time Streaming	CosyVoice2 + vLLM	Low latency, streaming support

Q: How does CosyVoice compare to commercial TTS services?

A: Advantages over commercial services:

✅ Full control and customization
✅ No API costs or rate limits
✅ Data privacy (on-premise deployment)
✅ Access to model weights for research

Commercial services may offer:

⚡ Simpler integration
🔧 Managed infrastructure
📞 Enterprise support

For most technical teams, CosyVoice's performance and flexibility outweigh the setup complexity.

Additional Resources

Official Links

GitHub Repository: https://github.com/FunAudioLLM/CosyVoice
Paper (v3.0): https://arxiv.org/abs/2505.17589
Demo Website: https://funaudiollm.github.io/cosyvoice3/
Hugging Face: https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512
ModelScope: https://www.modelscope.cn/studios/FunAudioLLM/Fun-CosyVoice3-0.5B

Community and Support

GitHub Issues: Report bugs and request features
DingTalk Group: Join the official Chinese community (QR code in repository)
Research Papers: Read the academic papers for deep technical understanding

Related Projects

CosyVoice builds upon:

FunASR - Automatic Speech Recognition
FunCodec - Audio Codec
Matcha-TTS - Flow Matching TTS
WeNet - Speech Recognition Toolkit

Conclusion and Next Steps

Fun-CosyVoice 3.0 represents a significant advancement in open-source text-to-speech technology, combining state-of-the-art performance with practical deployment capabilities. Its combination of high accuracy (0.81% CER), extensive language support (9 languages + 18 dialects), and production-ready features (streaming, low latency) makes it an excellent choice for both research and commercial applications.

Recommended Action Plan

Get Started (Week 1):
- Install CosyVoice following the setup guide
- Test with web demo to understand capabilities
- Experiment with different models and modes
Evaluate (Week 2-3):
- Test with your specific use cases
- Benchmark performance on your hardware
- Compare quality against your requirements
Deploy (Week 4+):
- Choose appropriate deployment method
- Implement monitoring and logging
- Optimize for your production workload
Optimize (Ongoing):
- Fine-tune on domain-specific data if needed
- Implement caching strategies
- Scale infrastructure based on usage

Stay Updated

The CosyVoice project is actively maintained with regular updates. Check the roadmap in the GitHub repository for upcoming features and improvements.

Disclaimer: This guide is based on information available as of December 2025. Always refer to the official documentation for the most current information and best practices.

CosyVoice Complete Guide