GLM-TTS Complete Guide 2025: Revolutionary Zero-Shot Voice Cloning with Reinforcement Learning
๐ฏ Core Highlights (TL;DR)
- Open-Source Excellence: GLM-TTS achieves the lowest Character Error Rate (0.89) among open-source TTS models while maintaining high speaker similarity
- Zero-Shot Capability: Clone any voice with just 3-10 seconds of audio prompt without fine-tuning
- RL-Enhanced Emotions: Multi-reward reinforcement learning framework delivers more natural and expressive speech compared to traditional TTS systems
- Production-Ready: Supports streaming inference, bilingual processing (Chinese/English), and phoneme-level pronunciation control
- Active Development: Released December 11, 2025, with ongoing updates including 2D Vocos vocoder and RL-optimized weights
Table of Contents
- What is GLM-TTS?
- Key Features and Capabilities
- System Architecture Explained
- How Does Reinforcement Learning Improve TTS?
- Performance Benchmarks
- Installation and Quick Start
- Use Cases and Applications
- Comparison with Other TTS Models
- Common Issues and Solutions
- FAQ
What is GLM-TTS?
GLM-TTS (Generative Language Model - Text-to-Speech) is a cutting-edge, open-source text-to-speech synthesis system developed by Zhipu AI's CogAudio Group. Released in December 2025, it represents a significant advancement in voice cloning technology by combining large language models with reinforcement learning optimization.
Core Innovation
Unlike traditional TTS systems that struggle with emotional expressiveness, GLM-TTS introduces a Multi-Reward Reinforcement Learning framework that evaluates generated speech across multiple dimensions:
- Sound quality and naturalness
- Speaker similarity
- Emotional expression
- Pronunciation accuracy (Character Error Rate)
- Prosody and rhythm
๐ก Key Advantage
GLM-TTS achieves a Character Error Rate of 0.89 with RL optimization - the best among open-source models and competitive with commercial systems like MiniMax (0.83 CER).
Key Features and Capabilities
1. Zero-Shot Voice Cloning
What it means: Clone any speaker's voice without training or fine-tuning
Requirements:
- 3-10 seconds of prompt audio
- No speaker-specific model training needed
- Works with any voice sample
Technical approach:
- Extracts speaker embeddings using CamPlus ONNX model
- Conditions the generation process on these embeddings
- Maintains voice characteristics across different text inputs
2. RL-Enhanced Emotion Control
The system uses GRPO (Group Relative Policy Optimization) algorithm with multiple reward functions:
| Reward Type | Purpose | Impact |
|---|---|---|
| Similarity | Match speaker characteristics | High speaker fidelity |
| CER (Character Error Rate) | Pronunciation accuracy | Reduced from 1.03 to 0.89 |
| Emotion | Natural emotional expression | More expressive speech |
| Laughter | Appropriate laugh insertion | Enhanced naturalness |
3. Phoneme-Level Control (Phoneme-in)
Problem solved: Automatic pronunciation ambiguity in polyphones and rare characters
Example: The Chinese character "่ก" can be pronounced as xรญng or hรกng depending on context
Solution: Hybrid Phoneme + Text input mechanism
Workflow:
1. Global G2P (Grapheme-to-Phoneme) conversion
2. Dynamic dictionary lookup for polyphones
3. Targeted phoneme replacement
4. Hybrid input generation
โ ๏ธ Use Case Specificity
Phoneme-level control is particularly valuable for:
- Educational content and assessments
- Audiobook production
- Language learning applications
- Technical documentation with specialized terminology
4. Streaming Inference Support
- Real-time audio generation
- Suitable for interactive applications
- Low-latency processing
- Ideal for conversational AI and virtual assistants
5. Bilingual Support
- Primary: Chinese language
- Secondary: English
- Mixed text processing capability
- Text normalization for both languages
System Architecture Explained
GLM-TTS employs a sophisticated two-stage architecture:
Stage 1: LLM-Based Token Generation
Model: Llama-based architecture
Input: Text (with optional phoneme annotations)
Output: Speech token sequences
Modes supported:
- Pretrained (PRETRAIN)
- Fine-tuning (SFT)
- LoRA (Low-Rank Adaptation)
Stage 2: Flow Matching for Waveform Synthesis
Components:
- DiT (Diffusion Transformer): Converts tokens to mel-spectrograms
- Vocoder: Generates final audio waveforms
- Vocos vocoder (current)
- 2D Vocos vocoder (coming soon)
- Hift vocoder (alternative)
Architecture Visualization
Text Input โ Frontend Processing โ LLM (Token Generation)
โ
Speech Tokens โ Flow Matching Model โ Mel-Spectrogram
โ
Vocoder โ Audio Waveform Output
[Parallel Path]
Prompt Audio โ Speaker Embedding Extraction โ Conditioning Signal
๐ Technical Specifications
- VRAM requirement: ~8GB for inference
- Supported Python versions: 3.10 - 3.12
- Model size: Multiple components totaling several GB
- Inference speed: Supports real-time streaming
How Does Reinforcement Learning Improve TTS?
Traditional TTS systems often produce flat, emotionless speech. GLM-TTS addresses this through a multi-reward RL framework:
The GRPO Training Process
-
Generation Phase
- Model generates multiple speech candidates for the same text
- Each candidate is synthesized through the full pipeline
-
Reward Computation
- Distributed reward server evaluates each candidate
- Multiple reward functions run in parallel
- Token-level rewards provide fine-grained feedback
-
Policy Optimization
- GRPO algorithm compares candidates within each group
- Updates LLM policy to favor higher-reward generations
- Balances multiple objectives simultaneously
Measurable Improvements
| Metric | Base Model | RL-Optimized | Improvement |
|---|---|---|---|
| CER | 1.03 | 0.89 | 13.6% reduction |
| Similarity | 76.1 | 76.4 | 0.3% increase |
| Expressiveness | Baseline | Enhanced | Qualitative |
โ Best Practice
The RL-optimized model (GLM-TTS_RL) is recommended for production use when emotional expressiveness is critical, while the base model may be sufficient for straightforward narration tasks.
Performance Benchmarks
Seed-TTS-Eval Chinese Test Set Results
Evaluated without phoneme flag to maintain consistency with original benchmarks:
| Model | CER โ | SIM โ | Open Source | Notes |
|---|---|---|---|---|
| GLM-TTS_RL | 0.89 | 76.4 | โ Yes | Best open-source CER |
| VoxCPM | 0.93 | 77.2 | โ Yes | Strong similarity |
| GLM-TTS Base | 1.03 | 76.1 | โ Yes | Pre-RL baseline |
| IndexTTS2 | 1.03 | 76.5 | โ Yes | Comparable CER |
| DiTAR | 1.02 | 75.3 | โ No | Closed source |
| CosyVoice3 | 1.12 | 78.1 | โ No | Higher similarity |
| Seed-TTS | 1.12 | 79.6 | โ No | Best similarity |
| MiniMax | 0.83 | 78.3 | โ No | Best overall CER |
| F5-TTS | 1.53 | 76.0 | โ Yes | Open alternative |
| CosyVoice2 | 1.38 | 75.7 | โ Yes | Open alternative |
Key Findings
- GLM-TTS_RL leads all open-source models in pronunciation accuracy (CER)
- Only 0.06 points behind the best commercial model (MiniMax)
- Maintains competitive speaker similarity scores
- Significantly outperforms other open-source alternatives
Installation and Quick Start
Prerequisites
- Python 3.10, 3.11, or 3.12
- ~8GB VRAM for inference
- Git and pip installed
- CUDA-compatible GPU recommended (CPU inference possible but slower)
Step 1: Clone Repository
git clone https://github.com/zai-org/GLM-TTS.git cd GLM-TTS
Step 2: Install Dependencies
pip install -r requirements.txt
โ ๏ธ Common Installation Issue
Users on Linux may encounter problems with WeTextProcessing/cython/pynini.Solution:
# Comment out WeTextProcessing in requirements.txt, then: pip install -r requirements.txt pip install WeTextProcessing pip install soxr
Step 3: Download Pre-trained Models
Option A: HuggingFace
mkdir -p ckpt pip install -U huggingface_hub huggingface-cli download zai-org/GLM-TTS --local-dir ckpt
Option B: ModelScope (China)
mkdir -p ckpt pip install -U modelscope modelscope download --model ZhipuAI/GLM-TTS --local_dir ckpt
Step 4: Run Inference
Command Line:
python glmtts_inference.py \ --data=example_zh \ --exp_name=_test \ --use_cache # Add --phoneme flag for phoneme-level control
Interactive Web Interface:
python tools/gradio_app.py
Step 5 (Optional): Install RL Components
For training or advanced features:
cd grpo/modules git clone https://github.com/s3prl/s3prl git clone https://github.com/omine-me/LaughterSegmentation # Download wavlm_large_finetune.pth to grpo/ckpt/
Use Cases and Applications
1. Content Creation
- Audiobook production: Phoneme control for accurate pronunciation
- Podcast generation: Natural, expressive narration
- Video voiceovers: Quick voice cloning for character consistency
2. Educational Technology
- Language learning: Accurate pronunciation modeling
- E-learning platforms: Engaging, emotional narration
- Assessment tools: Pronunciation evaluation reference
3. Accessibility
- Screen readers: More natural voice output
- Assistive communication: Personalized voice synthesis
- Text-to-speech for visually impaired users
4. Entertainment
- Game character voices: Zero-shot voice cloning for NPCs
- Virtual influencers: Consistent voice identity
- Interactive storytelling: Emotional voice adaptation
5. Enterprise Applications
- Customer service bots: Natural conversation flow
- IVR systems: Professional voice synthesis
- Internal training materials: Consistent narration
Comparison with Other TTS Models
GLM-TTS vs. CosyVoice2
| Aspect | GLM-TTS | CosyVoice2 |
|---|---|---|
| CER | 0.89 (RL) / 1.03 (base) | 1.38 |
| Architecture | LLM + Flow | Different approach |
| RL Optimization | โ Yes | โ No |
| Open Source | โ Full | โ Full |
| Phoneme Control | โ Hybrid input | Limited |
GLM-TTS vs. F5-TTS
| Aspect | GLM-TTS | F5-TTS |
|---|---|---|
| CER | 0.89 | 1.53 |
| Memory Usage | ~8GB VRAM | Lower (competitor advantage) |
| Emotional Expression | RL-enhanced | Standard |
| Streaming | โ Yes | โ Yes |
| Language Support | CN/EN | Varies |
GLM-TTS vs. Commercial Models (Seed-TTS, MiniMax)
Advantages of GLM-TTS:
- โ Fully open-source
- โ Self-hostable
- โ No API costs
- โ Privacy control
- โ Customizable
Advantages of Commercial Models:
- Slightly better CER (MiniMax: 0.83 vs GLM-TTS: 0.89)
- Higher similarity scores (Seed-TTS: 79.6 vs GLM-TTS: 76.4)
- Managed infrastructure
- No local hardware requirements
๐ก Decision Framework
Choose GLM-TTS if you need:
- Full control over the model
- Privacy for sensitive content
- Cost savings at scale
- Customization capabilities
Choose commercial models if you need:
- Absolute best quality
- Zero infrastructure management
- Immediate deployment
Common Issues and Solutions
Issue 1: Installation Failures on Linux
Symptom: Errors with WeTextProcessing, cython, or pynini during pip install -r requirements.txt
Solution:
# Edit requirements.txt to comment out WeTextProcessing pip install -r requirements.txt pip install WeTextProcessing pip install soxr
Confirmed working on: Linux/WSL with conda Python 3.12
Issue 2: Online Demo Returns 404
Symptom: The link to audio.z.ai demo is not accessible
Status: Demo infrastructure not yet deployed (as of December 11, 2025)
Workaround: Use local Gradio interface:
python tools/gradio_app.py
Issue 3: Contractions Expanded in Output
Symptom: "I'm" becomes "I am", "don't" becomes "do not" in generated audio
Cause: Model trained to expand contractions for clarity
Workaround:
- Pre-process text to expand contractions manually
- Or accept this behavior as designed (similar to Star Trek's Data character)
Issue 4: Chinese Accent in English Output
Symptom: English speech has noticeable Chinese accent
Cause: Model primarily trained on Chinese data with English as secondary language
Expected behavior: Similar to native Chinese speakers who lived in English-speaking countries for a few years
Mitigation:
- Use English-native prompt audio
- Consider fine-tuning on English-heavy datasets
- Or use specialized English TTS models for accent-critical applications
Issue 5: Special Characters Cause Output Issues
Symptom: A single underscore _ or other special characters make the rest of output go haywire
Cause: Frontend text processing limitations
Solution:
- Pre-process text to remove or replace special characters
- Use text normalization utilities in
cosyvoice/cli/frontend.py - Report specific cases to the GitHub repository
Issue 6: High VRAM Usage
Symptom: ~8GB VRAM required, limiting accessibility
Context: This is expected for the full model pipeline
Alternatives for lower VRAM:
- Use quantized models (when available)
- Consider lighter alternatives like Kokoro or F5-TTS
- Use CPU inference (slower but possible)
Issue 7: Suspicious File Warning on HuggingFace
Symptom: "This model has 1 file scanned as suspicious" - pickle imports detected on generator_jit.ckpt
Explanation: PyTorch pickle files can contain arbitrary code
Status: Team needs to convert pickles to safetensors format
Risk mitigation:
- Download from official sources only
- Review code before running
- Use in isolated environments
- Wait for safetensors conversion
Troubleshooting: No Streaming Code in Repository
Question from community: "It says it can be used for realtime streaming. I don't see code for that in the repo. Anyone know how to do that?"
Current status:
- Streaming capability mentioned in documentation
- Implementation details in
flow/flow.py(Streaming Flow model) - Specific streaming inference examples not yet provided
Recommendation:
- Check
flow/flow.pyfor streaming implementation - Monitor GitHub issues for community solutions
- Consider contributing streaming examples to the project
๐ค Frequently Asked Questions (FAQ)
Q: What languages does GLM-TTS support?
A: GLM-TTS primarily supports Chinese with secondary support for English. It can handle mixed Chinese-English text. For other languages, the model does not have native support, though some users have experimented with phoneme input using espeak-ng to output IPA (International Phonetic Alphabet). However, the tokenizer is optimized for Pinyin (Chinese phonemes), so results for other languages may be unpredictable.
Q: How much VRAM do I need to run GLM-TTS?
A: Approximately 8GB VRAM is required for inference with the full model pipeline. This includes:
- LLM for token generation
- Flow model for mel-spectrogram conversion
- Vocoder for waveform synthesis
For lower VRAM systems, consider using CPU inference (slower) or waiting for quantized model releases.
Q: Can I fine-tune GLM-TTS for a specific voice or language?
A: Yes, the model supports multiple training modes:
- LoRA (Low-Rank Adaptation): Efficient fine-tuning for specific voices
- SFT (Supervised Fine-Tuning): Full model fine-tuning
- Pretrained mode: Use as-is without fine-tuning
Configuration files are provided in the configs/ directory. However, detailed fine-tuning tutorials are not yet available in the documentation.
Q: How does GLM-TTS compare to Elevenlabs?
A: Quality: Elevenlabs still leads in overall naturalness and emotional range, but GLM-TTS is competitive, especially with RL optimization.
Language support: Elevenlabs supports 29+ languages, while GLM-TTS focuses on Chinese and English.
Cost: GLM-TTS is free and open-source; Elevenlabs is a paid service.
Privacy: GLM-TTS can be self-hosted for complete data control.
Customization: GLM-TTS offers full model access for customization.
Q: What's the difference between GLM-TTS and GLM-TTS_RL?
A:
-
GLM-TTS (Base): The pre-trained model without reinforcement learning optimization
- CER: 1.03
- Similarity: 76.1
- Standard emotional expressiveness
-
GLM-TTS_RL: The same model after multi-reward RL optimization
- CER: 0.89 (13.6% improvement)
- Similarity: 76.4
- Enhanced emotional expressiveness and prosody
Recommendation: Use GLM-TTS_RL for production applications where quality is critical.
Q: Is GLM-TTS suitable for real-time applications?
A: Yes, GLM-TTS supports streaming inference, making it suitable for:
- Interactive voice assistants
- Real-time conversation systems
- Live narration applications
However, actual latency depends on hardware capabilities. With adequate GPU resources, real-time performance is achievable.
Q: How do I control pronunciation of specific words?
A: Use the Phoneme-in mechanism:
- Enable phoneme mode:
--phonemeflag - Use hybrid input format: mix text with phoneme annotations
- Configure custom pronunciation in
configs/custom_replace.jsonl - The system will use your specified phonemes for marked words while processing the rest normally
This is particularly useful for:
- Polyphones (words with multiple pronunciations)
- Rare characters
- Technical terminology
- Proper nouns
Q: Can I use GLM-TTS commercially?
A: The model is open-source and released on GitHub and HuggingFace. Check the repository's LICENSE file for specific terms. Generally, open-source models allow commercial use, but:
- Verify the license terms
- Note that prompt audio examples in the repository are marked "for research use only"
- Ensure your use case complies with any restrictions
Q: What's coming next for GLM-TTS?
A: According to the project roadmap:
- 2D Vocos vocoder update (in progress)
- RL-optimized model weights (coming soon)
- Potential for additional language support
- Community contributions for streaming examples
- Improved documentation and tutorials
Q: How can I contribute to the project?
A: The project welcomes contributions:
- Report issues on GitHub
- Submit pull requests for bug fixes or features
- Share your use cases and results
- Contribute to documentation
- Help with language support expansion
Repository: https://github.com/zai-org/GLM-TTS
Community Reception and Feedback
Positive Reactions
From the Reddit discussion on r/LocalLLaMA:
"How many models are you guys gonna release! This is insane in a good way!" - Community excitement about ZAI's rapid release pace
"Kudos GLM team, keep it up guys." - Appreciation for open-source contributions
Concerns and Requests
- Language support: Multiple users requested support beyond Chinese and English
- Installation complexity: Several users spent hours troubleshooting dependencies
- Documentation gaps: Lack of clear examples and demos initially
- Model abandonment fears: Community hopes the project remains actively maintained, citing other abandoned TTS projects
Comparison with Other Models
Community members actively discussed GLM-TTS in context of:
- Qwen-2.5-Omni: Another multimodal model with TTS capabilities
- Chatterbox: Praised for multilingual support
- VoxCPM: Noted for LoRA fine-tuning capabilities
- Kokoro and F5-TTS: Compared for memory efficiency
Best Practices for Using GLM-TTS
1. Prompt Audio Selection
โ Do:
- Use clean, high-quality audio (16kHz or higher)
- Choose 3-10 seconds of clear speech
- Select audio with consistent volume
- Prefer single-speaker recordings
โ Don't:
- Use audio with background noise
- Use multi-speaker recordings
- Use music or non-speech audio
- Use heavily compressed audio
2. Text Preparation
โ Do:
- Normalize text (remove special characters)
- Use proper punctuation for prosody
- Expand abbreviations
- Use phoneme annotations for ambiguous words
โ Don't:
- Include markdown or HTML formatting
- Use excessive special characters
- Rely on contractions if you need them preserved
- Mix too many languages in one sentence
3. Performance Optimization
- Use caching: Enable
--use_cacheflag to avoid reprocessing - Batch processing: Process multiple texts together when possible
- GPU selection: Use CUDA-compatible GPU for best performance
- Model selection: Use base model for simple narration, RL model for expressive content
4. Quality Assurance
- Listen to outputs: Always review generated audio
- Test edge cases: Verify pronunciation of numbers, dates, abbreviations
- Compare speakers: Test with different prompt audio to find best match
- Iterate on text: Adjust punctuation and phrasing for better prosody
Technical Deep Dive: Project Structure
Understanding the codebase organization:
GLM-TTS/
โโโ glmtts_inference.py # Main entry point
โโโ configs/ # Configuration files
โ โโโ spk_prompt_dict.yaml # Speaker prompts
โ โโโ G2P_*.json # Phoneme conversion
โ โโโ custom_replace.jsonl # Custom rules
โโโ llm/
โ โโโ glmtts.py # LLM implementation
โโโ flow/
โ โโโ dit.py # Diffusion Transformer
โ โโโ flow.py # Streaming Flow model
โ โโโ modules.py # Flow components
โโโ grpo/ # Reinforcement Learning
โ โโโ grpo_utils.py # GRPO algorithm
โ โโโ reward_func.py # Reward functions
โ โโโ reward_server.py # Distributed rewards
โ โโโ train_ds_grpo.py # Training script
โโโ cosyvoice/
โ โโโ cli/frontend.py # Text/audio preprocessing
โโโ frontend/
โ โโโ campplus.onnx # Speaker embedding
โ โโโ cosyvoice_frontend.yaml # Frontend config
โโโ tools/
โโโ gradio_app.py # Web interface
โโโ ffmpeg_speech_control.py # Audio processing
Key Components to Explore
- For inference customization:
glmtts_inference.py - For phoneme control:
utils/glm_g2p.pyandconfigs/G2P_*.json - For RL training:
grpo/train_ds_grpo.py - For frontend modifications:
cosyvoice/cli/frontend.py - For streaming:
flow/flow.py
Conclusion and Recommendations
Key Takeaways
- GLM-TTS sets a new standard for open-source TTS with its 0.89 CER, outperforming all other open-source alternatives
- Reinforcement learning makes a measurable difference in both quality metrics and emotional expressiveness
- Zero-shot voice cloning works effectively with just 3-10 seconds of prompt audio
- The project is actively developed with a clear roadmap and responsive community
Who Should Use GLM-TTS?
Ideal for:
- Developers building voice applications in Chinese or English
- Content creators needing high-quality voice synthesis
- Researchers exploring TTS and RL techniques
- Organizations requiring self-hosted, privacy-preserving TTS
- Projects where pronunciation accuracy is critical
Consider alternatives if:
- You need support for languages beyond Chinese/English
- You have very limited VRAM (<8GB)
- You need the absolute highest quality (consider commercial options)
- You want a more mature, extensively documented solution
Next Steps
- Try the demo: Install locally and test with your use case
- Join the community: Follow the GitHub repository for updates
- Experiment with RL model: Compare base vs. RL-optimized versions
- Explore phoneme control: Test pronunciation accuracy for your domain
- Contribute back: Share your findings, report issues, or submit improvements
Resources
- GitHub Repository: https://github.com/zai-org/GLM-TTS
- HuggingFace Model: https://huggingface.co/zai-org/GLM-TTS
- ModelScope (China): https://modelscope.cn/models/ZhipuAI/GLM-TTS
- Official Demo (coming soon): https://audio.z.ai/
- Community Discussion: r/LocalLLaMA on Reddit
Citation
If you use GLM-TTS in your research or projects, please cite:
@misc{glmtts2025, title={GLM-TTS: Controllable & Emotion-Expressive Zero-shot TTS with Multi-Reward Reinforcement Learning}, author={CogAudio Group Members}, year={2025}, publisher={Zhipu AI Inc} }
Last Updated: December 11, 2025
Model Version: GLM-TTS v1.0 (Base and RL-optimized)
Status: Active development with upcoming 2D Vocos vocoder update
๐ก Stay Updated
Star the GitHub repository to receive notifications about new releases, including the upcoming RL-optimized weights and 2D Vocos vocoder improvements.