CosyVoice 2025 Complete Guide: The Ultimate Multi-lingual Text-to-Speech Solution
π― Core Highlights (TL;DR)
- State-of-the-art Performance: Fun-CosyVoice 3.0 achieves industry-leading content consistency (0.81% CER) and speaker similarity (77.4%) with only 0.5B parameters
- Extensive Language Support: Covers 9 major languages and 18+ Chinese dialects with zero-shot voice cloning capability
- Production-Ready Features: Bi-streaming support with ultra-low latency (150ms), pronunciation inpainting, and instruction-based control
- Open-Source & Scalable: Fully open-source with complete training/inference/deployment pipeline and multiple runtime options (vLLM, TensorRT-LLM)
Table of Contents
- What is CosyVoice?
- Key Features and Capabilities
- Model Versions Comparison
- Performance Benchmarks
- Installation and Setup
- Usage Guide
- Deployment Options
- Best Practices
- FAQ
What is CosyVoice?
CosyVoice is an advanced Large Language Model (LLM)-based Text-to-Speech (TTS) system developed by FunAudioLLM. It represents a significant leap in zero-shot multilingual speech synthesis technology, enabling natural voice generation across multiple languages without requiring extensive training data for each speaker.
Evolution Timeline
The CosyVoice family has evolved through three major versions:
- CosyVoice 1.0 (July 2024): Initial release with 300M parameters, establishing the foundation for scalable multilingual TTS
- CosyVoice 2.0 (December 2024): Introduced streaming capabilities with 0.5B parameters and enhanced LLM architecture
- Fun-CosyVoice 3.0 (December 2025): Current state-of-the-art with reinforcement learning optimization and in-the-wild speech generation
π‘ Expert Insight
CosyVoice 3.0's use of supervised semantic tokens and flow matching training enables it to achieve human-like speech quality while maintaining computational efficiencyβa critical balance for production deployments.
Key Features and Capabilities
π Language Coverage
Supported Languages:
- 9 Major Languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
- 18+ Chinese Dialects: Guangdong (Cantonese), Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, and more
Cross-lingual Capabilities:
- Zero-shot voice cloning across different languages
- Multi-lingual speech synthesis from single prompt
- Accent-preserving voice conversion
π― Advanced Technical Features
| Feature | Description | Use Case |
|---|---|---|
| Pronunciation Inpainting | Support for Chinese Pinyin and English CMU phonemes | Precise control over pronunciation for brand names, technical terms |
| Bi-Streaming | Text-in and audio-out streaming | Real-time applications with 150ms latency |
| Instruct Support | Control language, dialect, emotion, speed, volume | Dynamic voice customization |
| Text Normalization | Automatic handling of numbers, symbols, formats | No frontend module required |
| RAS Inference | Repetition Aware Sampling for LLM stability | Prevents audio artifacts and repetitions |
π Performance Characteristics
Latency: As low as 150ms (streaming mode)
Model Size: 0.5B parameters (Fun-CosyVoice3)
Audio Quality: 25Hz sampling rate
Streaming: KV cache + SDPA optimization
Acceleration: 4x speedup with TensorRT-LLM
β οΈ Important Note
While CosyVoice 3.0 offers impressive capabilities, optimal performance requires GPU acceleration. CPU-only inference may result in significantly slower generation times.
Model Versions Comparison
Available Models
| Model | Parameters | Best For | Key Advantage |
|---|---|---|---|
| Fun-CosyVoice3-0.5B-2512 | 0.5B | Production use, best overall quality | SOTA performance with RL optimization |
| Fun-CosyVoice3-0.5B-2512_RL | 0.5B | Maximum accuracy | Lowest CER (0.81%) and WER (1.68%) |
| CosyVoice2-0.5B | 0.5B | Streaming applications | Optimized for real-time synthesis |
| CosyVoice-300M | 300M | Resource-constrained environments | Smaller footprint, good quality |
| CosyVoice-300M-SFT | 300M | Supervised fine-tuning tasks | Pre-trained for specific voice styles |
| CosyVoice-300M-Instruct | 300M | Instruction-based synthesis | Enhanced control capabilities |
Version Selection Guide
Performance Benchmarks
Comprehensive Evaluation Results
The following table compares Fun-CosyVoice 3.0 against leading open-source and closed-source TTS systems:
| Model | Open-Source | Size | test-zh CER (%) β | test-zh Speaker Sim (%) β | test-en WER (%) β | test-en Speaker Sim (%) β | test-hard CER (%) β | test-hard Speaker Sim (%) β |
|---|---|---|---|---|---|---|---|---|
| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
| Seed-TTS | β | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
| MiniMax-Speech | β | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
| F5-TTS | β | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
| CosyVoice2 | β | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
| VoxCPM | β | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
| GLM-TTS RL | β | 1.5B | 0.89 | 76.4 | - | - | - | - |
| Fun-CosyVoice3-0.5B-2512 | β | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
| Fun-CosyVoice3-0.5B-2512_RL | β | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
Key Performance Insights
β Best Practices
- Content Accuracy: Fun-CosyVoice3 RL achieves 0.81% CER on Chinese test set, outperforming models 3x larger
- Speaker Similarity: 78.0% similarity score approaches human-level performance (75.5%)
- Challenging Scenarios: 5.44% CER on hard test set demonstrates robust handling of complex speech patterns
- Efficiency: Achieves SOTA results with only 0.5B parameters vs. 1.5B+ competitors
Installation and Setup
Prerequisites
- Operating System: Linux (Ubuntu/CentOS recommended)
- Python Version: 3.10
- GPU: NVIDIA GPU with CUDA support (recommended for optimal performance)
- Conda: Miniconda or Anaconda
Step-by-Step Installation
1. Clone the Repository
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git cd CosyVoice # If submodule cloning fails due to network issues git submodule update --init --recursive
2. Create Conda Environment
conda create -n cosyvoice -y python=3.10 conda activate cosyvoice pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
3. Install System Dependencies
# Ubuntu sudo apt-get install sox libsox-dev # CentOS sudo yum install sox sox-devel
4. Download Pre-trained Models
For Hugging Face Users (Recommended for International Users):
from huggingface_hub import snapshot_download # Download Fun-CosyVoice 3.0 (Recommended) snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B') # Download CosyVoice 2.0 snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B') # Download text normalization resources snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
For ModelScope Users (China Region):
from modelscope import snapshot_download snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B') snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B') snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
5. Optional: Install Enhanced Text Normalization
For improved text normalization (especially for Chinese):
cd pretrained_models/CosyVoice-ttsfrd/ unzip resource.zip -d . pip install ttsfrd_dependency-0.1-py3-none-any.whl pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
π‘ Pro Tip
If you skip the ttsfrd installation, CosyVoice will automatically fall back to WeTextProcessing. While functional, ttsfrd provides better accuracy for number and symbol normalization.
Usage Guide
Quick Start with Web Demo
The fastest way to experience CosyVoice:
# Launch web interface python3 webui.py --port 50000 --model_dir pretrained_models/Fun-CosyVoice3-0.5B # For instruct mode python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct
Access the interface at http://localhost:50000
Python API Usage
Basic Inference Example
from cosyvoice.cli.cosyvoice import CosyVoice # Initialize model cosyvoice = CosyVoice('pretrained_models/Fun-CosyVoice3-0.5B') # Zero-shot voice cloning prompt_speech = 'path/to/reference_audio.wav' text = "Hello, this is a test of CosyVoice zero-shot synthesis." for i, audio_chunk in enumerate(cosyvoice.inference_zero_shot( text, prompt_text="Reference text spoken in the audio", prompt_speech=prompt_speech )): # Save or process audio_chunk pass
Advanced Usage: Instruction-Based Synthesis
# Control emotion, speed, and other parameters for audio in cosyvoice.inference_instruct( text="Your text here", speaker="default", instruct_text="Speak with excitement at a moderate pace" ): # Process audio pass
vLLM Acceleration (CosyVoice 2.0)
For maximum inference speed with CosyVoice 2.0:
Setup vLLM Environment
# Create separate environment for vLLM conda create -n cosyvoice_vllm --clone cosyvoice conda activate cosyvoice_vllm pip install vllm==v0.9.0 transformers==4.51.3
Run vLLM Inference
python vllm_example.py
β οΈ Compatibility Note
vLLM v0.9.0 requires specific versions of PyTorch (2.7.0) and Transformers (4.51.3). Ensure your hardware supports these requirements before installation.
Deployment Options
Docker Deployment (Recommended for Production)
gRPC Server Deployment
cd runtime/python docker build -t cosyvoice:v1.0 . # Launch gRPC server docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \ /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && \ python3 server.py --port 50000 --max_conc 4 \ --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity" # Test with client cd grpc python3 client.py --port 50000 --mode zero_shot
FastAPI Server Deployment
# Launch FastAPI server docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \ /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && \ python3 server.py --port 50000 \ --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity" # Test with client cd fastapi python3 client.py --port 50000 --mode sft
TensorRT-LLM Deployment (4x Acceleration)
For maximum performance with CosyVoice 2.0:
cd runtime/triton_trtllm docker compose up -d
Performance Comparison:
| Runtime | Relative Speed | Use Case |
|---|---|---|
| HuggingFace Transformers | 1x (baseline) | Development, testing |
| vLLM | 2-3x | Production with moderate load |
| TensorRT-LLM | 4x | High-throughput production |
β Deployment Best Practice
- Development: Use web demo or Python API
- Small-scale Production: FastAPI with Docker
- Large-scale Production: TensorRT-LLM with load balancing
- Real-time Applications: vLLM or TensorRT-LLM with streaming
Best Practices
Model Selection Strategy
π Decision Framework:
1. Quality Priority β Fun-CosyVoice3-0.5B-2512_RL
2. Balanced Performance β Fun-CosyVoice3-0.5B-2512
3. Real-time Streaming β CosyVoice2-0.5B + vLLM
4. Resource Constraints β CosyVoice-300M
5. Custom Control β CosyVoice-300M-Instruct
Optimization Tips
For Low Latency
- Enable Streaming Mode: Use bi-streaming for text-in and audio-out
- KV Cache: Ensure KV cache is enabled in inference config
- SDPA Optimization: Utilize Scaled Dot-Product Attention
- Batch Processing: Group similar-length inputs
For High Quality
- Use RL Model: Fun-CosyVoice3-0.5B-2512_RL for maximum accuracy
- Provide Clear Prompts: High-quality reference audio (3-10 seconds)
- Text Normalization: Install ttsfrd for better preprocessing
- Pronunciation Control: Use pinyin/phoneme inpainting for critical terms
For Multilingual Applications
- Language-Specific Prompts: Provide reference audio in target language
- Cross-lingual Cloning: Use instruct mode to specify target language
- Dialect Support: Leverage 18+ Chinese dialect capabilities
- Mixed Language: Segment text by language for optimal results
Common Pitfalls to Avoid
β οΈ Warning: Common Issues
- Insufficient GPU Memory: 0.5B models require ~8GB VRAM minimum
- Poor Reference Audio: Background noise or multiple speakers degrade cloning
- Text Format Issues: Ensure proper encoding (UTF-8) for non-English text
- Version Mismatch: vLLM compatibility requires specific package versions
- Network Timeouts: Use ModelScope mirrors in China region
π€ Frequently Asked Questions
Q: What's the difference between CosyVoice 2.0 and 3.0?
A: Fun-CosyVoice 3.0 introduces several key improvements:
- Reinforcement Learning Optimization: RL-trained model achieves 0.81% CER vs. 1.45% in v2.0
- Enhanced Naturalness: Improved prosody and speaker similarity through post-training
- In-the-wild Performance: Better handling of challenging real-world scenarios (5.44% vs. 6.83% CER on hard test set)
- Pronunciation Control: Advanced pinyin/phoneme inpainting capabilities
Q: Can I use CosyVoice for commercial applications?
A: Yes, CosyVoice is open-source and available for commercial use. However:
- Review the license terms in the GitHub repository
- Ensure compliance with voice cloning regulations in your jurisdiction
- The disclaimer states content is for academic purposes; verify production use rights
- Consider ethical implications of voice cloning technology
Q: How much GPU memory do I need?
A: Memory requirements vary by model:
- CosyVoice-300M: ~4-6GB VRAM
- CosyVoice2-0.5B: ~6-8GB VRAM
- Fun-CosyVoice3-0.5B: ~8-10GB VRAM
- Batch Inference: Add 2-4GB per additional concurrent request
For CPU-only inference, expect 16GB+ RAM and significantly slower speeds (10-50x slower).
Q: Which languages are best supported?
A: Based on evaluation data:
- Excellent: Chinese (Mandarin), English
- Very Good: Japanese, Korean
- Good: German, Spanish, French, Italian, Russian
- Dialects: 18+ Chinese dialects with varying quality
English and Chinese have the most extensive training data and achieve the best results.
Q: How do I improve voice cloning quality?
A: Follow these guidelines:
-
Reference Audio Quality:
- Duration: 3-10 seconds optimal
- Single speaker only
- Clear speech, minimal background noise
- Natural speaking pace
-
Prompt Text Accuracy:
- Provide exact transcription of reference audio
- Match language and dialect
-
Model Selection:
- Use Fun-CosyVoice3-0.5B-2512_RL for best quality
- Consider fine-tuning for specific voices
-
Post-processing:
- Apply noise reduction if needed
- Normalize audio levels
Q: Can I fine-tune CosyVoice on my own data?
A: Yes, the repository includes training scripts in examples/libritts/cosyvoice/run.sh. Requirements:
- High-quality paired audio-text data
- GPU cluster (multi-GPU recommended)
- Familiarity with flow matching training
- See the paper for detailed training methodology
Q: What's the best deployment option for my use case?
A: Choose based on your requirements:
| Scenario | Recommended Setup | Rationale |
|---|---|---|
| Research/Testing | Web demo or Python API | Easy setup, full features |
| Small API (<100 req/day) | FastAPI + Docker | Simple deployment, good performance |
| Medium API (100-10K req/day) | vLLM + Load Balancer | 2-3x speedup, scalable |
| High-throughput (>10K req/day) | TensorRT-LLM + Kubernetes | 4x speedup, enterprise-grade |
| Real-time Streaming | CosyVoice2 + vLLM | Low latency, streaming support |
Q: How does CosyVoice compare to commercial TTS services?
A: Advantages over commercial services:
- β Full control and customization
- β No API costs or rate limits
- β Data privacy (on-premise deployment)
- β Access to model weights for research
Commercial services may offer:
- β‘ Simpler integration
- π§ Managed infrastructure
- π Enterprise support
For most technical teams, CosyVoice's performance and flexibility outweigh the setup complexity.
Additional Resources
Official Links
- GitHub Repository: https://github.com/FunAudioLLM/CosyVoice
- Paper (v3.0): https://arxiv.org/abs/2505.17589
- Demo Website: https://funaudiollm.github.io/cosyvoice3/
- Hugging Face: https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512
- ModelScope: https://www.modelscope.cn/studios/FunAudioLLM/Fun-CosyVoice3-0.5B
Community and Support
- GitHub Issues: Report bugs and request features
- DingTalk Group: Join the official Chinese community (QR code in repository)
- Research Papers: Read the academic papers for deep technical understanding
Related Projects
CosyVoice builds upon:
- FunASR - Automatic Speech Recognition
- FunCodec - Audio Codec
- Matcha-TTS - Flow Matching TTS
- WeNet - Speech Recognition Toolkit
Conclusion and Next Steps
Fun-CosyVoice 3.0 represents a significant advancement in open-source text-to-speech technology, combining state-of-the-art performance with practical deployment capabilities. Its combination of high accuracy (0.81% CER), extensive language support (9 languages + 18 dialects), and production-ready features (streaming, low latency) makes it an excellent choice for both research and commercial applications.
Recommended Action Plan
-
Get Started (Week 1):
- Install CosyVoice following the setup guide
- Test with web demo to understand capabilities
- Experiment with different models and modes
-
Evaluate (Week 2-3):
- Test with your specific use cases
- Benchmark performance on your hardware
- Compare quality against your requirements
-
Deploy (Week 4+):
- Choose appropriate deployment method
- Implement monitoring and logging
- Optimize for your production workload
-
Optimize (Ongoing):
- Fine-tune on domain-specific data if needed
- Implement caching strategies
- Scale infrastructure based on usage
Stay Updated
The CosyVoice project is actively maintained with regular updates. Check the roadmap in the GitHub repository for upcoming features and improvements.
Disclaimer: This guide is based on information available as of December 2025. Always refer to the official documentation for the most current information and best practices.