Xiaomi MiMo-V2-Flash: Complete Guide to the 309B Parameter MoE Model (2025)
šÆ Core Highlights (TL;DR)
- Efficient Architecture: 309B total parameters with only 15B active parameters using Mixture-of-Experts (MoE) design
- Breakthrough Performance: Claims to match or exceed Claude Sonnet 4.5 and GPT-5 on coding benchmarks like SWE-Bench Multilingual
- Innovative Technology: Hybrid Sliding Window Attention reduces KV cache by 6x, Multi-Token Prediction triples inference speed
- Open Source: Fully available on Hugging Face with weights and technical documentation
- Mixed Reception: Community testing shows inconsistent results - some praise efficiency, others report instruction-following issues
Table of Contents
- What is MiMo-V2-Flash?
- Technical Architecture Deep Dive
- Performance Benchmarks Analysis
- How to Deploy and Run
- Community Feedback and Real-World Testing
- Comparison with Competitors
- Best Practices and Recommendations
- FAQ
What is MiMo-V2-Flash?
MiMo-V2-Flash is Xiaomi's latest open-source large language model, released in early 2025. It represents a significant entry into the competitive open LLM space, positioning itself as a high-performance yet efficient alternative to models like DeepSeek V3.2 and Claude Sonnet 4.5.
Key Specifications
| Specification | Details |
|---|---|
| Total Parameters | 309B |
| Active Parameters | 15B (per forward pass) |
| Architecture Type | Mixture-of-Experts (MoE) |
| Context Window | Up to 256K tokens |
| Training Data | 27T tokens with FP8 mixed precision |
| License | Open source (available on Hugging Face) |
| Variants | Base model and Post-trained model |
š” What Makes It "Flash"?
The "Flash" designation refers to inference speed, not model size. Despite 309B total parameters, the MoE architecture activates only 15B per request, enabling faster generation while maintaining quality.
Technical Architecture Deep Dive
Hybrid Sliding Window Attention (SWA)
MiMo-V2-Flash's most innovative feature is its attention mechanism:
Architecture Design:
- 5:1 Ratio: 5 Sliding Window Attention (SWA) layers followed by 1 Global Attention (GA) layer
- Aggressive Window Size: Only 128 tokens (compared to typical 2048-4096)
- Attention Sink Bias: Learnable bias mechanism maintains long-context performance
- KV Cache Reduction: Nearly 6x reduction in memory requirements
Multi-Token Prediction (MTP)
Unlike traditional speculative decoding, MiMo integrates MTP natively:
- Lightweight Design: Only 0.33B parameters per block
- Dense FFN Architecture: Uses dense feed-forward networks instead of MoE for prediction heads
- 3x Speed Boost: Triples output generation speed during inference
- RL Training Benefits: Accelerates rollout in reinforcement learning training
ā ļø Technical Note
The MTP module uses Sliding Window Attention instead of Global Attention to keep parameter count minimal, which is crucial for maintaining the "Flash" speed advantage.
Training Infrastructure
Pre-training Specifications:
- 27 trillion tokens processed
- FP8 mixed precision training
- Native 32K sequence length
- Extended to 256K context window support
Post-training Innovation:
- Multi-Teacher On-Policy Distillation (MOPD): Dense token-level guidance from domain-specific expert models
- Large-Scale Agentic RL: Over 100,000 verifiable GitHub issue tasks
- Multimodal Verifier: Vision-based code verification using video recordings
Performance Benchmarks Analysis
Base Model Performance
The MiMo-V2-Flash-Base shows competitive results across standard benchmarks:
| Category | Benchmark | MiMo-V2-Flash | Kimi-K2 Base | DeepSeek-V3.2 |
|---|---|---|---|---|
| General | MMLU (5-shot) | 86.7 | 87.8 | 87.8 |
| MMLU-Pro (5-shot) | 73.2 | 69.2 | 62.1 | |
| GPQA-Diamond | 55.1 | 48.1 | 52.0 | |
| Math | GSM8K (8-shot) | 92.3 | 92.1 | 91.1 |
| MATH (4-shot) | 71.0 | 70.2 | 62.5 | |
| AIME 24&25 | 35.3 | 31.6 | 24.8 | |
| Code | HumanEval+ | 70.7 | 84.8 | 67.7 |
| BigCodeBench | 70.1 | 61.7 | 62.9 | |
| LiveCodeBench v6 | 30.8 | 26.3 | 24.9 |
Post-Training Model Performance
After MOPD and Agentic RL training, the model shows impressive gains:
| Benchmark | MiMo-V2-Flash | Claude Sonnet 4.5 | GPT-5 High | DeepSeek-V3.2 |
|---|---|---|---|---|
| AIME 2025 | 94.1 | 87.0 | 94.6 | 93.1 |
| LiveCodeBench-v6 | 80.6 | 64.0 | 84.5 | 83.3 |
| SWE-Bench Verified | 73.4 | 77.2 | 74.9 | 73.1 |
| SWE-Bench Multilingual | 71.7 | 68.0 | 55.3 | 70.2 |
| LongBench V2 | 60.6 | 61.8 | - | 58.4 |
ā Key Strengths
- Exceptional performance on multilingual coding tasks
- Strong mathematical reasoning (AIME 2025: 94.1%)
- Competitive long-context handling up to 256K tokens
- Superior efficiency: 15B active vs 37B+ in competitors
ā ļø Potential Concerns
- Some benchmarks show suspiciously high scores for model size
- Community testing reveals inconsistencies with official benchmarks
- Possible benchmark contamination or overfitting concerns raised
How to Deploy and Run
System Requirements
Minimum Hardware for Q4 Quantization:
- 2x RTX 5060 Ti (16GB each) = 32GB VRAM
- 128GB System RAM
- Expected speed: ~8 tokens/second
Recommended Hardware:
- 8x H100 or A100 GPUs for full FP8 precision
- High-bandwidth interconnect (NVLink/InfiniBand)
š” Quantization Recommendations
- Q3/IQ3_XS: Fits on 32GB VRAM, minimal quality loss
- Q4: Borderline for 32GB, may require optimization
- FP8: Requires 160GB+ VRAM for full model
Installation with SGLang (Recommended)
# Install SGLang pip install sglang # Launch server with optimized settings python3 -m sglang.launch_server \ --model-path XiaomiMiMo/MiMo-V2-Flash \ --served-model-name mimo-v2-flash \ --pp-size 1 \ --dp-size 2 \ --enable-dp-attention \ --tp-size 8 \ --moe-a2a-backend deepep \ --page-size 1 \ --host 0.0.0.0 \ --port 9001 \ --trust-remote-code \ --mem-fraction-static 0.75 \ --max-running-requests 128 \ --chunked-prefill-size 16384 \ --reasoning-parser qwen3 \ --tool-call-parser mimo \ --context-length 262144 \ --attention-backend fa3 \ --speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \ --enable-mtp
API Usage Example
curl -i http://localhost:9001/v1/chat/completions \ -H 'Content-Type:application/json' \ -d '{ "messages": [{ "role": "system", "content": "You are MiMo, an AI assistant developed by Xiaomi. Today'\''s date: 2025-12-17. Your knowledge cutoff date is December 2024." }, { "role": "user", "content": "Write a Python function to calculate Fibonacci numbers" }], "model": "mimo-v2-flash", "max_tokens": 4096, "temperature": 0.8, "top_p": 0.95, "stream": true, "chat_template_kwargs": { "enable_thinking": true } }'
Critical Configuration Parameters
| Parameter | Recommended Value | Purpose |
|---|---|---|
temperature | 0.8 (math/writing) 0.3 (coding/agents) | Controls randomness |
top_p | 0.95 | Nucleus sampling threshold |
enable_thinking | true | Activates reasoning mode |
context_length | 262144 | Maximum context window |
enable-mtp | true | Enables Multi-Token Prediction |
ā ļø Important: System Prompt Required
The model performs significantly better with the official system prompt. Always include identity and date information for optimal results.
Alternative Deployment Options
OpenRouter Integration:
- Free tier available directly from Xiaomi
- No local hardware required
- API-compatible with OpenAI format
llama.cpp Compatibility:
- Not officially supported at launch
- Community working on GGUF conversion
- Rare architecture may cause compatibility issues
Community Feedback and Real-World Testing
Positive Experiences
From Reddit user feedback:
"From my testing it is as good as Sonnet 4.5 easily" - u/power97992 (Portuguese comment translated)
Efficiency Praise:
- "Holy shit 309B total but only 15B active? That's actually insane efficiency if the benchmarks hold up" - u/Constant_Leader_6558
- Successfully runs on consumer hardware with quantization
- Fast inference speed maintained even with long contexts
Critical Observations
Instruction Following Issues:
"I really wished it was as good as they claimed it to be... it was all over the place, not able to follow instructions. Tool call was unreliable. Sometimes a simple 'hello how are you', gave me code in return." - u/ashim_k_saha
Benchmark Skepticism:
- "The SWE-Bench performance is suspiciously good for a model of this size" - u/r4in311
- "A pure 300B of junk. Bad instruction following, bad reasoning, the ultimate result of benchmaxxing" - u/Just_Lifeguard_5033
Comparative Performance:
- Most users report it's comparable to MiniMax 2
- Generally considered worse than DeepSeek V3.2 Speciale
- Mixed results on creative writing vs technical tasks
Aider Benchmark Results
Community-run Aider benchmark shows the model's practical coding performance in a standardized test environment, though specific scores were shared as an image in the Reddit thread.
š” Community Consensus
The model shows promise but exhibits inconsistency. Performance varies significantly based on task type, with stronger results in mathematical reasoning than general instruction following.
Comparison with Competitors
MiMo-V2-Flash vs DeepSeek V3.2
| Aspect | MiMo-V2-Flash | DeepSeek V3.2 |
|---|---|---|
| Total Parameters | 309B | 671B |
| Active Parameters | 15B | 37B |
| Efficiency | ā Higher (5:1 ratio) | Moderate |
| Code Benchmarks | Mixed (claims better) | Proven strong |
| Community Trust | ā ļø Skeptical | ā Established |
| Instruction Following | ā ļø Reported issues | ā Reliable |
| Open Source | ā Full weights | ā Full weights |
MiMo-V2-Flash vs Claude Sonnet 4.5
| Aspect | MiMo-V2-Flash | Claude Sonnet 4.5 |
|---|---|---|
| Accessibility | ā Open source | ā Closed API |
| Cost | Free (self-hosted) | $3/$15 per MTok |
| SWE-Bench Multilingual | 71.7 (claimed) | 68.0 |
| SWE-Bench Verified | 73.4 | 77.2 |
| Real-world Coding | ā ļø Inconsistent reports | ā Highly reliable |
| Context Window | 256K | 200K |
MiMo-V2-Flash vs Kimi K2
| Aspect | MiMo-V2-Flash | Kimi K2 |
|---|---|---|
| Active Parameters | 15B | 32B |
| AIME 2025 | 94.1 | 94.5 |
| Long Context | Good (256K) | Excellent |
| Chinese Performance | 87.4 (CMMLU) | 90.9 (CMMLU) |
| Thinking Mode | Available | Advanced |
Best Practices and Recommendations
When to Use MiMo-V2-Flash
ā Recommended For:
- Mathematical reasoning tasks (strong AIME/MATH performance)
- Multilingual code generation
- Long-context processing (up to 256K tokens)
- Resource-constrained environments (15B active is efficient)
- Experimentation with MoE architectures
- Research on hybrid attention mechanisms
ā Not Recommended For:
- Mission-critical production applications (inconsistent reliability)
- Complex multi-turn conversations (instruction following issues)
- Tool calling workflows (reported unreliability)
- Tasks requiring consistent output formatting
Optimization Tips
1. Temperature Tuning:
{ "math_tasks": {"temperature": 0.8, "top_p": 0.95}, "coding_tasks": {"temperature": 0.3, "top_p": 0.95}, "creative_writing": {"temperature": 0.8, "top_p": 0.95} }
2. System Prompt Engineering:
- Always include the official system prompt
- Specify current date and knowledge cutoff
- Use clear, structured instructions
- Break complex tasks into steps
3. Context Management:
- Leverage the 256K context window for long documents
- Use chunked prefill (16384 tokens) for better performance
- Monitor KV cache usage with the hybrid attention system
4. Multi-Turn Conversations:
- Persist
reasoning_contentin message history - Be explicit about conversation context
- Validate outputs before chaining requests
Hardware Optimization
For Consumer Hardware (32GB VRAM):
- Use IQ3_XS or Q3 quantization
- Enable MTP for faster generation
- Limit concurrent requests to 2-4
- Use CPU offloading for KV cache if needed
For Enterprise Deployment:
- Deploy with 8x A100/H100 GPUs
- Use FP8 precision for optimal quality
- Enable data parallelism (dp-size 2+)
- Implement request-level prefix caching
š¤ Frequently Asked Questions
Q: Can I run MiMo-V2-Flash on consumer hardware?
A: Yes, with quantization. A setup with 2x RTX 5060 Ti (16GB each) and 128GB RAM can run Q3/Q4 quantized versions at approximately 8 tokens/second. However, llama.cpp support is not guaranteed due to the model's unique architecture.
Q: Is MiMo-V2-Flash better than DeepSeek V3.2?
A: Mixed results. Official benchmarks show competitive or superior performance, but community testing suggests DeepSeek V3.2 (especially the Speciale variant) is more reliable for general use. MiMo excels in specific areas like multilingual coding and mathematical reasoning but struggles with instruction following.
Q: Why are the benchmark scores controversial?
A: The community has raised concerns because:
- Scores seem unusually high for a 15B active parameter model
- Real-world testing doesn't consistently match benchmark claims
- Some users report basic instruction-following failures
- Potential benchmark contamination or overfitting suspected
Q: What does "Flash" mean in the model name?
A: "Flash" refers to inference speed, not model size. The MoE architecture activates only 15B of 309B total parameters per request, combined with Multi-Token Prediction that triples generation speed, making it faster than traditional dense models of similar quality.
Q: How does the Hybrid Sliding Window Attention work?
A: The model uses a 5:1 ratio of local (128-token window) to global attention layers. This reduces KV cache memory by ~6x while maintaining long-context performance through learnable attention sink bias. It's an aggressive optimization that trades some global context awareness for efficiency.
Q: Is there llama.cpp support?
A: Not officially at launch. The model uses a rare configuration (48 layers, short SWA window, MoE architecture) that may not be immediately compatible with llama.cpp. The community is working on GGUF conversion, but compatibility isn't guaranteed.
Q: What's the recommended temperature setting?
A:
- 0.8 for math, writing, and web development
- 0.3 for agentic tasks (coding, tool use)
- Always use top_p=0.95
Q: Can I use it for free?
A: Yes, in two ways:
- Self-hosting: Download weights from Hugging Face and run locally
- OpenRouter: Free tier available, hosted by Xiaomi
Q: What's Multi-Teacher On-Policy Distillation (MOPD)?
A: MOPD is Xiaomi's training innovation where:
- Multiple expert models provide token-level guidance
- The student model learns from its own generations (on-policy)
- Eliminates exposure bias from fixed datasets
- Provides natural resistance to reward hacking
Q: How does it compare to GPT-4 or Claude for coding?
A: Official benchmarks claim competitive or superior performance on specific coding tasks (especially SWE-Bench Multilingual). However, community consensus suggests it's less reliable than Claude Sonnet 4.5 or GPT-4 for production coding work, with better results on mathematical/algorithmic problems than general software engineering.
Conclusion and Recommendations
MiMo-V2-Flash represents an ambitious entry into the open-source LLM space, showcasing innovative architectural choices and impressive benchmark numbers. However, the gap between claimed performance and community testing results warrants careful evaluation.
Final Verdict
Strengths:
- ā Exceptional efficiency (15B active parameters)
- ā Strong mathematical reasoning capabilities
- ā Innovative hybrid attention architecture
- ā Fully open source with detailed documentation
- ā Competitive performance on specific benchmarks
Weaknesses:
- ā ļø Inconsistent instruction following
- ā ļø Unreliable tool calling
- ā ļø Benchmark claims don't fully match real-world testing
- ā ļø Limited ecosystem support (no llama.cpp yet)
Who Should Use It?
Ideal Users:
- Researchers exploring MoE architectures
- Developers with specific math/coding tasks
- Teams with resources to fine-tune and validate
- Enthusiasts experimenting with efficient LLMs
Should Wait:
- Production applications requiring reliability
- Teams without validation resources
- Users expecting Claude/GPT-4 level consistency
Next Steps
- Try the Free API: Test via OpenRouter before committing to self-hosting
- Run Specific Benchmarks: Validate performance on your actual use cases
- Monitor Community Updates: Architecture may improve with community contributions
- Compare Alternatives: Evaluate against DeepSeek V3.2, Qwen, and other open models
- Join the Community: Participate in Xiaomi's WeChat groups or GitHub discussions
Resources
- Official Repository: Hugging Face - XiaomiMiMo/MiMo-V2-Flash
- Technical Report: Available in the model repository
- Community Discussion: r/LocalLLaMA on Reddit
- Contact: mimo@xiaomi.com
Last Updated: December 2025 | Model Version: MiMo-V2-Flash | Status: Open Source