Sora Watermark Remover - Allows you to remove the watermark from Sora videos.Try Now

CurateClick

CosyVoice 2025 Complete Guide: The Ultimate Multi-lingual Text-to-Speech Solution

🎯 Core Highlights (TL;DR)

  • State-of-the-art Performance: Fun-CosyVoice 3.0 achieves industry-leading content consistency (0.81% CER) and speaker similarity (77.4%) with only 0.5B parameters
  • Extensive Language Support: Covers 9 major languages and 18+ Chinese dialects with zero-shot voice cloning capability
  • Production-Ready Features: Bi-streaming support with ultra-low latency (150ms), pronunciation inpainting, and instruction-based control
  • Open-Source & Scalable: Fully open-source with complete training/inference/deployment pipeline and multiple runtime options (vLLM, TensorRT-LLM)

Table of Contents

  1. What is CosyVoice?
  2. Key Features and Capabilities
  3. Model Versions Comparison
  4. Performance Benchmarks
  5. Installation and Setup
  6. Usage Guide
  7. Deployment Options
  8. Best Practices
  9. FAQ

What is CosyVoice?

CosyVoice is an advanced Large Language Model (LLM)-based Text-to-Speech (TTS) system developed by FunAudioLLM. It represents a significant leap in zero-shot multilingual speech synthesis technology, enabling natural voice generation across multiple languages without requiring extensive training data for each speaker.

Evolution Timeline

The CosyVoice family has evolved through three major versions:

  • CosyVoice 1.0 (July 2024): Initial release with 300M parameters, establishing the foundation for scalable multilingual TTS
  • CosyVoice 2.0 (December 2024): Introduced streaming capabilities with 0.5B parameters and enhanced LLM architecture
  • Fun-CosyVoice 3.0 (December 2025): Current state-of-the-art with reinforcement learning optimization and in-the-wild speech generation

πŸ’‘ Expert Insight
CosyVoice 3.0's use of supervised semantic tokens and flow matching training enables it to achieve human-like speech quality while maintaining computational efficiencyβ€”a critical balance for production deployments.

Key Features and Capabilities

🌍 Language Coverage

Supported Languages:

  • 9 Major Languages: Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian
  • 18+ Chinese Dialects: Guangdong (Cantonese), Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, and more

Cross-lingual Capabilities:

  • Zero-shot voice cloning across different languages
  • Multi-lingual speech synthesis from single prompt
  • Accent-preserving voice conversion

🎯 Advanced Technical Features

FeatureDescriptionUse Case
Pronunciation InpaintingSupport for Chinese Pinyin and English CMU phonemesPrecise control over pronunciation for brand names, technical terms
Bi-StreamingText-in and audio-out streamingReal-time applications with 150ms latency
Instruct SupportControl language, dialect, emotion, speed, volumeDynamic voice customization
Text NormalizationAutomatic handling of numbers, symbols, formatsNo frontend module required
RAS InferenceRepetition Aware Sampling for LLM stabilityPrevents audio artifacts and repetitions

πŸš€ Performance Characteristics

Latency: As low as 150ms (streaming mode)
Model Size: 0.5B parameters (Fun-CosyVoice3)
Audio Quality: 25Hz sampling rate
Streaming: KV cache + SDPA optimization
Acceleration: 4x speedup with TensorRT-LLM

⚠️ Important Note
While CosyVoice 3.0 offers impressive capabilities, optimal performance requires GPU acceleration. CPU-only inference may result in significantly slower generation times.

Model Versions Comparison

Available Models

ModelParametersBest ForKey Advantage
Fun-CosyVoice3-0.5B-25120.5BProduction use, best overall qualitySOTA performance with RL optimization
Fun-CosyVoice3-0.5B-2512_RL0.5BMaximum accuracyLowest CER (0.81%) and WER (1.68%)
CosyVoice2-0.5B0.5BStreaming applicationsOptimized for real-time synthesis
CosyVoice-300M300MResource-constrained environmentsSmaller footprint, good quality
CosyVoice-300M-SFT300MSupervised fine-tuning tasksPre-trained for specific voice styles
CosyVoice-300M-Instruct300MInstruction-based synthesisEnhanced control capabilities

Version Selection Guide

Performance Benchmarks

Comprehensive Evaluation Results

The following table compares Fun-CosyVoice 3.0 against leading open-source and closed-source TTS systems:

ModelOpen-SourceSizetest-zh CER (%) ↓test-zh Speaker Sim (%) ↑test-en WER (%) ↓test-en Speaker Sim (%) ↑test-hard CER (%) ↓test-hard Speaker Sim (%) ↑
Human--1.2675.52.1473.4--
Seed-TTS❌-1.1279.62.2576.27.5977.6
MiniMax-Speech❌-0.8378.31.6569.2--
F5-TTSβœ…0.3B1.5274.12.0064.78.6771.3
CosyVoice2βœ…0.5B1.4575.72.5765.96.8372.4
VoxCPMβœ…0.5B0.9377.21.8572.98.8773.0
GLM-TTS RLβœ…1.5B0.8976.4----
Fun-CosyVoice3-0.5B-2512βœ…0.5B1.2178.02.2471.86.7175.8
Fun-CosyVoice3-0.5B-2512_RLβœ…0.5B0.8177.41.6869.55.4475.0

Key Performance Insights

βœ… Best Practices

  • Content Accuracy: Fun-CosyVoice3 RL achieves 0.81% CER on Chinese test set, outperforming models 3x larger
  • Speaker Similarity: 78.0% similarity score approaches human-level performance (75.5%)
  • Challenging Scenarios: 5.44% CER on hard test set demonstrates robust handling of complex speech patterns
  • Efficiency: Achieves SOTA results with only 0.5B parameters vs. 1.5B+ competitors

Installation and Setup

Prerequisites

  • Operating System: Linux (Ubuntu/CentOS recommended)
  • Python Version: 3.10
  • GPU: NVIDIA GPU with CUDA support (recommended for optimal performance)
  • Conda: Miniconda or Anaconda

Step-by-Step Installation

1. Clone the Repository

git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git cd CosyVoice # If submodule cloning fails due to network issues git submodule update --init --recursive

2. Create Conda Environment

conda create -n cosyvoice -y python=3.10 conda activate cosyvoice pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com

3. Install System Dependencies

# Ubuntu sudo apt-get install sox libsox-dev # CentOS sudo yum install sox sox-devel

4. Download Pre-trained Models

For Hugging Face Users (Recommended for International Users):

from huggingface_hub import snapshot_download # Download Fun-CosyVoice 3.0 (Recommended) snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B') # Download CosyVoice 2.0 snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B') # Download text normalization resources snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')

For ModelScope Users (China Region):

from modelscope import snapshot_download snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B') snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B') snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')

5. Optional: Install Enhanced Text Normalization

For improved text normalization (especially for Chinese):

cd pretrained_models/CosyVoice-ttsfrd/ unzip resource.zip -d . pip install ttsfrd_dependency-0.1-py3-none-any.whl pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl

πŸ’‘ Pro Tip
If you skip the ttsfrd installation, CosyVoice will automatically fall back to WeTextProcessing. While functional, ttsfrd provides better accuracy for number and symbol normalization.

Usage Guide

Quick Start with Web Demo

The fastest way to experience CosyVoice:

# Launch web interface python3 webui.py --port 50000 --model_dir pretrained_models/Fun-CosyVoice3-0.5B # For instruct mode python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M-Instruct

Access the interface at http://localhost:50000

Python API Usage

Basic Inference Example

from cosyvoice.cli.cosyvoice import CosyVoice # Initialize model cosyvoice = CosyVoice('pretrained_models/Fun-CosyVoice3-0.5B') # Zero-shot voice cloning prompt_speech = 'path/to/reference_audio.wav' text = "Hello, this is a test of CosyVoice zero-shot synthesis." for i, audio_chunk in enumerate(cosyvoice.inference_zero_shot( text, prompt_text="Reference text spoken in the audio", prompt_speech=prompt_speech )): # Save or process audio_chunk pass

Advanced Usage: Instruction-Based Synthesis

# Control emotion, speed, and other parameters for audio in cosyvoice.inference_instruct( text="Your text here", speaker="default", instruct_text="Speak with excitement at a moderate pace" ): # Process audio pass

vLLM Acceleration (CosyVoice 2.0)

For maximum inference speed with CosyVoice 2.0:

Setup vLLM Environment

# Create separate environment for vLLM conda create -n cosyvoice_vllm --clone cosyvoice conda activate cosyvoice_vllm pip install vllm==v0.9.0 transformers==4.51.3

Run vLLM Inference

python vllm_example.py

⚠️ Compatibility Note
vLLM v0.9.0 requires specific versions of PyTorch (2.7.0) and Transformers (4.51.3). Ensure your hardware supports these requirements before installation.

Deployment Options

Docker Deployment (Recommended for Production)

gRPC Server Deployment

cd runtime/python docker build -t cosyvoice:v1.0 . # Launch gRPC server docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \ /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && \ python3 server.py --port 50000 --max_conc 4 \ --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity" # Test with client cd grpc python3 client.py --port 50000 --mode zero_shot

FastAPI Server Deployment

# Launch FastAPI server docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \ /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && \ python3 server.py --port 50000 \ --model_dir pretrained_models/Fun-CosyVoice3-0.5B && sleep infinity" # Test with client cd fastapi python3 client.py --port 50000 --mode sft

TensorRT-LLM Deployment (4x Acceleration)

For maximum performance with CosyVoice 2.0:

cd runtime/triton_trtllm docker compose up -d

Performance Comparison:

RuntimeRelative SpeedUse Case
HuggingFace Transformers1x (baseline)Development, testing
vLLM2-3xProduction with moderate load
TensorRT-LLM4xHigh-throughput production

βœ… Deployment Best Practice

  • Development: Use web demo or Python API
  • Small-scale Production: FastAPI with Docker
  • Large-scale Production: TensorRT-LLM with load balancing
  • Real-time Applications: vLLM or TensorRT-LLM with streaming

Best Practices

Model Selection Strategy

πŸ“Š Decision Framework:

1. Quality Priority β†’ Fun-CosyVoice3-0.5B-2512_RL
2. Balanced Performance β†’ Fun-CosyVoice3-0.5B-2512
3. Real-time Streaming β†’ CosyVoice2-0.5B + vLLM
4. Resource Constraints β†’ CosyVoice-300M
5. Custom Control β†’ CosyVoice-300M-Instruct

Optimization Tips

For Low Latency

  1. Enable Streaming Mode: Use bi-streaming for text-in and audio-out
  2. KV Cache: Ensure KV cache is enabled in inference config
  3. SDPA Optimization: Utilize Scaled Dot-Product Attention
  4. Batch Processing: Group similar-length inputs

For High Quality

  1. Use RL Model: Fun-CosyVoice3-0.5B-2512_RL for maximum accuracy
  2. Provide Clear Prompts: High-quality reference audio (3-10 seconds)
  3. Text Normalization: Install ttsfrd for better preprocessing
  4. Pronunciation Control: Use pinyin/phoneme inpainting for critical terms

For Multilingual Applications

  1. Language-Specific Prompts: Provide reference audio in target language
  2. Cross-lingual Cloning: Use instruct mode to specify target language
  3. Dialect Support: Leverage 18+ Chinese dialect capabilities
  4. Mixed Language: Segment text by language for optimal results

Common Pitfalls to Avoid

⚠️ Warning: Common Issues

  1. Insufficient GPU Memory: 0.5B models require ~8GB VRAM minimum
  2. Poor Reference Audio: Background noise or multiple speakers degrade cloning
  3. Text Format Issues: Ensure proper encoding (UTF-8) for non-English text
  4. Version Mismatch: vLLM compatibility requires specific package versions
  5. Network Timeouts: Use ModelScope mirrors in China region

πŸ€” Frequently Asked Questions

Q: What's the difference between CosyVoice 2.0 and 3.0?

A: Fun-CosyVoice 3.0 introduces several key improvements:

  • Reinforcement Learning Optimization: RL-trained model achieves 0.81% CER vs. 1.45% in v2.0
  • Enhanced Naturalness: Improved prosody and speaker similarity through post-training
  • In-the-wild Performance: Better handling of challenging real-world scenarios (5.44% vs. 6.83% CER on hard test set)
  • Pronunciation Control: Advanced pinyin/phoneme inpainting capabilities

Q: Can I use CosyVoice for commercial applications?

A: Yes, CosyVoice is open-source and available for commercial use. However:

  • Review the license terms in the GitHub repository
  • Ensure compliance with voice cloning regulations in your jurisdiction
  • The disclaimer states content is for academic purposes; verify production use rights
  • Consider ethical implications of voice cloning technology

Q: How much GPU memory do I need?

A: Memory requirements vary by model:

  • CosyVoice-300M: ~4-6GB VRAM
  • CosyVoice2-0.5B: ~6-8GB VRAM
  • Fun-CosyVoice3-0.5B: ~8-10GB VRAM
  • Batch Inference: Add 2-4GB per additional concurrent request

For CPU-only inference, expect 16GB+ RAM and significantly slower speeds (10-50x slower).

Q: Which languages are best supported?

A: Based on evaluation data:

  • Excellent: Chinese (Mandarin), English
  • Very Good: Japanese, Korean
  • Good: German, Spanish, French, Italian, Russian
  • Dialects: 18+ Chinese dialects with varying quality

English and Chinese have the most extensive training data and achieve the best results.

Q: How do I improve voice cloning quality?

A: Follow these guidelines:

  1. Reference Audio Quality:

    • Duration: 3-10 seconds optimal
    • Single speaker only
    • Clear speech, minimal background noise
    • Natural speaking pace
  2. Prompt Text Accuracy:

    • Provide exact transcription of reference audio
    • Match language and dialect
  3. Model Selection:

    • Use Fun-CosyVoice3-0.5B-2512_RL for best quality
    • Consider fine-tuning for specific voices
  4. Post-processing:

    • Apply noise reduction if needed
    • Normalize audio levels

Q: Can I fine-tune CosyVoice on my own data?

A: Yes, the repository includes training scripts in examples/libritts/cosyvoice/run.sh. Requirements:

  • High-quality paired audio-text data
  • GPU cluster (multi-GPU recommended)
  • Familiarity with flow matching training
  • See the paper for detailed training methodology

Q: What's the best deployment option for my use case?

A: Choose based on your requirements:

ScenarioRecommended SetupRationale
Research/TestingWeb demo or Python APIEasy setup, full features
Small API (<100 req/day)FastAPI + DockerSimple deployment, good performance
Medium API (100-10K req/day)vLLM + Load Balancer2-3x speedup, scalable
High-throughput (>10K req/day)TensorRT-LLM + Kubernetes4x speedup, enterprise-grade
Real-time StreamingCosyVoice2 + vLLMLow latency, streaming support

Q: How does CosyVoice compare to commercial TTS services?

A: Advantages over commercial services:

  • βœ… Full control and customization
  • βœ… No API costs or rate limits
  • βœ… Data privacy (on-premise deployment)
  • βœ… Access to model weights for research

Commercial services may offer:

  • ⚑ Simpler integration
  • πŸ”§ Managed infrastructure
  • πŸ“ž Enterprise support

For most technical teams, CosyVoice's performance and flexibility outweigh the setup complexity.

Additional Resources

Official Links

Community and Support

  • GitHub Issues: Report bugs and request features
  • DingTalk Group: Join the official Chinese community (QR code in repository)
  • Research Papers: Read the academic papers for deep technical understanding

Related Projects

CosyVoice builds upon:

Conclusion and Next Steps

Fun-CosyVoice 3.0 represents a significant advancement in open-source text-to-speech technology, combining state-of-the-art performance with practical deployment capabilities. Its combination of high accuracy (0.81% CER), extensive language support (9 languages + 18 dialects), and production-ready features (streaming, low latency) makes it an excellent choice for both research and commercial applications.

Recommended Action Plan

  1. Get Started (Week 1):

    • Install CosyVoice following the setup guide
    • Test with web demo to understand capabilities
    • Experiment with different models and modes
  2. Evaluate (Week 2-3):

    • Test with your specific use cases
    • Benchmark performance on your hardware
    • Compare quality against your requirements
  3. Deploy (Week 4+):

    • Choose appropriate deployment method
    • Implement monitoring and logging
    • Optimize for your production workload
  4. Optimize (Ongoing):

    • Fine-tune on domain-specific data if needed
    • Implement caching strategies
    • Scale infrastructure based on usage

Stay Updated

The CosyVoice project is actively maintained with regular updates. Check the roadmap in the GitHub repository for upcoming features and improvements.


Disclaimer: This guide is based on information available as of December 2025. Always refer to the official documentation for the most current information and best practices.

CosyVoice Complete Guide

Tags:
CosyVoice
Fun-CosyVoice3
Text-to-Speech
TTS Model
Multilingual TTS
Zero-Shot Voice Cloning
Speech Synthesis
AI Audio
Back to Blog
Last updated: December 15, 2025