Sora Watermark Remover - Allows you to remove the watermark from Sora videos.Try Now

CurateClick

DeepSeek OCR 2: Complete Guide to Running & Fine-tuning in 2026

🎯 Core Highlights (TL;DR)

  • Revolutionary Architecture: DeepSeek OCR 2 introduces DeepEncoder V2 with human-like visual reading order, achieving SOTA performance on document understanding
  • Lightweight & Powerful: Only 3B parameters but outperforms larger models on complex layouts, tables, and mixed text-structure documents
  • Easy Deployment: Run locally via vLLM, Transformers, or Unsloth with comprehensive fine-tuning support
  • Proven Results: 86-88% improvement in language understanding after fine-tuning, with 57-86% reduction in Character Error Rate (CER)
  • Open Source: Fully available on Hugging Face with detailed documentation and community support

Table of Contents

  1. What is DeepSeek OCR 2?
  2. Key Features & Architecture
  3. DeepSeek OCR 2 vs Other OCR Solutions
  4. How to Run DeepSeek OCR 2
  5. Fine-tuning Guide
  6. Performance Benchmarks
  7. Community Feedback & Real-world Usage
  8. FAQ
  9. Conclusion & Next Steps

What is DeepSeek OCR 2?

DeepSeek OCR 2 is a state-of-the-art 3B-parameter vision-language model released on January 27, 2026, by DeepSeek AI. Unlike traditional OCR systems that merely extract text, DeepSeek OCR 2 focuses on image-to-text with stronger visual reasoning, enabling comprehensive document understanding.

The Innovation: DeepEncoder V2

The breakthrough lies in DeepEncoder V2, which fundamentally changes how AI "sees" documents:

Traditional Vision LLMs:

  • Scan images in fixed grid patterns (top-left → bottom-right)
  • Process visual information sequentially without context
  • Struggle with complex layouts and multi-column documents

DeepSeek OCR 2 with DeepEncoder V2:

  • Builds global understanding first, then learns human-like reading order
  • Determines what to attend to first, next, and so on
  • Excels at following columns, linking labels to values, and reading tables coherently

💡 Key Insight
DeepEncoder V2 enables the model to 'see' an image in the same logical order as a human, dramatically improving accuracy on complex layouts.

Key Features & Architecture

Core Capabilities

FeatureDescriptionBenefit
Dynamic Resolution(0-6)×768×768 + 1×1024×1024Handles various document sizes efficiently
Visual Tokens(0-6)×144 + 256 tokensOptimized memory usage
Human-like ReadingDeepEncoder V2 architectureSuperior layout understanding
Compact SizeOnly 3B parametersFast inference, low resource requirements
Multi-format SupportImages, PDFs, documentsVersatile application scenarios

Supported Modes

DeepSeek OCR 2 supports multiple operation modes:

  • Document Mode: <image>\n<|grounding|>Convert the document to markdown.
  • Free OCR: <image>\nFree OCR. (without layout preservation)
  • Figure Parsing: <image>\nParse the figure.
  • General Vision: <image>\nDescribe this image in detail.
  • Recognition: <image>\nLocate <|ref|>xxxx<|/ref|> in the image.

⚠️ Important Note
For best results with structured documents, use the <|grounding|> tag to enable layout-aware processing.

DeepSeek OCR 2 vs Other OCR Solutions

Comprehensive Comparison

SolutionTypeStrengthsLimitationsBest For
DeepSeek OCR 2Open-source VLMHuman-like reading order, SOTA accuracy, fine-tunableRequires GPU for optimal performanceComplex documents, research, custom applications
MistralOCRClosed-source APIExtremely fast, excellent structure preservationNot open-source, API-dependentProduction pipelines, public documents
PaddleOCR-VLOpen-sourceStrong performance, comprehensive pipelineComplex setup, steep learning curveEnterprise deployments
Gemini FlashMultimodal LLMPhenomenal accuracy, semantic understandingAPI costs, privacy concernsGeneral-purpose OCR with reasoning
GPT-4o/ClaudeMultimodal LLMExcellent reasoning, conversationalExpensive, may hallucinateInteractive document analysis

Community Insights

Based on Reddit r/LocalLLaMA discussions:

MistralOCR User Pipeline:

MistralOCR (structure extraction) 
  → Qwen3-VL (semantic descriptions) 
  → Devstral (markdown cleanup) 
  → Kimi-K2 (summarization) 
  → Qwen3 (embeddings) 
  → pgvector (storage)

💡 Expert Opinion
"MistralOCR does better than SOTA multimodal LLMs like GPT-4o/Claude because it can maintain structure and include media in the output." - LocalLLaMA community member

When to Choose DeepSeek OCR 2:

  • ✅ Need full control and customization
  • ✅ Working with sensitive/private documents
  • ✅ Require fine-tuning for specific languages or domains
  • ✅ Want to run completely offline
  • ✅ Building research or academic projects

When to Consider Alternatives:

  • 🔄 Need fastest possible inference (→ MistralOCR API)
  • 🔄 Require semantic image understanding (→ Gemini/GPT-4o)
  • 🔄 Working only with public documents (→ Cloud APIs)

How to Run DeepSeek OCR 2

System Requirements

Minimum Requirements:

  • Python 3.12.9+
  • CUDA 11.8+ (NVIDIA GPU)
  • 8GB+ VRAM (for 4-bit quantization)
  • 16GB+ VRAM (for full precision)

Tested Environment:

torch==2.6.0
transformers==4.46.3
tokenizers==0.20.3
flash-attn==2.7.3
einops, addict, easydict

Advantages: Fastest inference, batch processing, production-ready

Installation

uv venv source .venv/bin/activate # Install vLLM nightly build (until v0.11.1 release) uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly

Usage Code

from vllm import LLM, SamplingParams from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor from PIL import Image # Create model instance llm = LLM( model="unsloth/DeepSeek-OCR-2", enable_prefix_caching=False, mm_processor_cache_gb=0, logits_processors=[NGramPerReqLogitsProcessor] ) # Prepare batched input image_1 = Image.open("document1.png").convert("RGB") image_2 = Image.open("document2.png").convert("RGB") prompt = "<image>\n<|grounding|>Convert the document to markdown." model_input = [ {"prompt": prompt, "multi_modal_data": {"image": image_1}}, {"prompt": prompt, "multi_modal_data": {"image": image_2}} ] sampling_param = SamplingParams( temperature=0.0, max_tokens=8192, extra_args=dict( ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}, # <td>, </td> ), skip_special_tokens=False, ) # Generate output model_outputs = llm.generate(model_input, sampling_param) for output in model_outputs: print(output.outputs[0].text)

💡 Pro Tip
The NGramPerReqLogitsProcessor prevents repetition issues (similar to Whisper's failure mode) and improves output quality.

Method 2: Hugging Face Transformers (Most Flexible)

Advantages: Full control, easy debugging, research-friendly

Installation

pip install torch==2.6.0 transformers==4.46.3 tokenizers==0.20.3 pip install einops addict easydict pip install flash-attn==2.7.3 --no-build-isolation

Usage Code

from transformers import AutoModel, AutoTokenizer import torch import os os.environ["CUDA_VISIBLE_DEVICES"] = '0' model_name = 'deepseek-ai/DeepSeek-OCR-2' tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModel.from_pretrained( model_name, _attn_implementation='flash_attention_2', trust_remote_code=True, use_safetensors=True ) model = model.eval().cuda().to(torch.bfloat16) # Run inference prompt = "<image>\n<|grounding|>Convert the document to markdown." image_file = 'your_image.jpg' output_path = 'output_directory' res = model.infer( tokenizer, prompt=prompt, image_file=image_file, output_path=output_path, base_size=1024, image_size=768, crop_mode=True, save_results=True )

Method 3: Unsloth (Best for Fine-tuning)

Advantages: 1.4x faster training, 40% less VRAM, 5x longer context

Installation

pip install --upgrade unsloth # Force update if already installed pip install --upgrade --force-reinstall --no-deps --no-cache-dir unsloth unsloth_zoo

Usage Code

from unsloth import FastVisionModel import torch from transformers import AutoModel import os os.environ["UNSLOTH_WARN_UNINITIALIZED"] = '0' from huggingface_hub import snapshot_download snapshot_download("unsloth/DeepSeek-OCR-2", local_dir="deepseek_ocr") model, tokenizer = FastVisionModel.from_pretrained( "./deepseek_ocr", load_in_4bit=False, # Set True for 4-bit quantization auto_model=AutoModel, trust_remote_code=True, unsloth_force_compile=True, use_gradient_checkpointing="unsloth", ) prompt = "<image>\nFree OCR." image_file = 'your_image.jpg' output_path = 'output_directory' res = model.infer( tokenizer, prompt=prompt, image_file=image_file, output_path=output_path, base_size=1024, image_size=640, crop_mode=True, save_results=True )

⚠️ ROCm/Vulkan Support
As of January 2026, DeepSeek OCR 2 primarily supports NVIDIA GPUs. AMD ROCm and Vulkan support is under community development.

Fine-tuning Guide

Why Fine-tune DeepSeek OCR 2?

Fine-tuning enables:

  • Language Adaptation: Support for non-English languages (e.g., Persian, Chinese, Arabic)
  • Domain Specialization: Medical records, legal documents, handwritten notes
  • Format Optimization: Custom output formats, specific markdown styles
  • Accuracy Improvement: 57-86% reduction in Character Error Rate (CER)

Performance Improvements

Persian Language Fine-tuning Results:

MetricBefore Fine-tuningAfter Fine-tuningImprovement
OCR 1 CER1.48660.6409-57%
OCR 2 CER4.18630.6018-86%
Overall--88.6% improvement

Step-by-Step Fine-tuning Process

1. Prepare Your Dataset

# Dataset format example dataset = [ { "image": "path/to/image1.jpg", "text": "Expected OCR output...", "prompt": "<image>\n<|grounding|>Convert the document to markdown." }, # More examples... ]

2. Use Unsloth Free Colab Notebook

Access the official notebook: Unsloth DeepSeek OCR Fine-tuning

3. Configure Training Parameters

from unsloth import FastVisionModel from trl import SFTTrainer from transformers import TrainingArguments # Load model for training model, tokenizer = FastVisionModel.from_pretrained( "unsloth/DeepSeek-OCR-2", load_in_4bit=True, use_gradient_checkpointing="unsloth", ) # Configure LoRA model = FastVisionModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], ) # Training arguments training_args = TrainingArguments( output_dir="./deepseek_ocr_finetuned", per_device_train_batch_size=2, gradient_accumulation_steps=4, num_train_epochs=3, learning_rate=2e-4, fp16=True, logging_steps=10, save_strategy="epoch", )

4. Train and Evaluate

trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=train_dataset, args=training_args, ) trainer.train() model.save_pretrained("./final_model")

Best Practice
Start with 100-500 high-quality examples. More data isn't always better—focus on diversity and accuracy.

Fine-tuning Benefits Summary

AspectBenefit
Training Speed1.4x faster than standard methods
Memory Usage40% less VRAM required
Context Length5x longer sequences supported
AccuracyNo degradation vs full fine-tuning
CostFree Colab notebook available

Performance Benchmarks

OmniDocBench v1.5 Results

Table 1: Comprehensive Document Reading Evaluation

ModelVisual TokensReading OrderOverall ScoreComplex LayoutsTablesMath
DeepSeek OCR 2256-1120✅ Human-likeSOTAExcellentExcellentExcellent
DeepSeek OCR 1256-1120❌ Grid-basedHighGoodGoodGood
PaddleOCR-VLVariable❌ Grid-basedHighGoodVery GoodGood
GPT-4oHigh❌ Grid-basedHighGoodGoodVery Good
Claude 3.5High❌ Grid-basedHighVery GoodGoodGood

Real-world Performance Insights

Community Testing Results:

  1. Document Skew Handling:

    • ✅ Handles 90°/180°/270° rotations reliably
    • ⚠️ Minor tilts/skews may reduce accuracy (preprocessing recommended)
  2. Repetition Issues:

    • Similar to Whisper, may repeat text in failure modes
    • Mitigated by NGramPerReqLogitsProcessor in vLLM
  3. Speed Comparison:

    • MistralOCR API: Fastest (cloud-based)
    • DeepSeek OCR 2 (vLLM): Fast (local, batched)
    • DeepSeek OCR 2 (Transformers): Moderate (local, flexible)
    • Local alternatives: Several orders of magnitude slower

💡 Expert Insight
"For pure text OCR, most recent models are nearly flawless. DeepSeek OCR 2 excels at complex math formatting and not hallucinating content." - LocalLLaMA community

Community Feedback & Real-world Usage

Positive Experiences

From r/LocalLLaMA users:

"My experience with [DeepSeek OCR 2] has been phenomenal. Though it's amazing, I also noticed that there is a failure mode which causes the model to repeat itself (like Whisper), not sure of the cause but something to take note of." - u/Pvt_Twinkietoes

"It is truly an amazing model and very grateful they open sourced it." - Community member

Production Pipelines

Advanced OCR Pipeline Example:

Input Document
    ↓
DeepSeek OCR 2 (structure + text extraction)
    ↓
Qwen3-VL (semantic figure descriptions)
    ↓
Devstral (markdown standardization)
    ↓
Kimi-K2 (summarization)
    ↓
Qwen3 (embeddings generation)
    ↓
pgvector (vector storage)

Use Case Scenarios

Use CaseRecommended SetupWhy
Research PapersDeepSeek OCR 2 + LocalPrivacy, math accuracy, custom formatting
Business DocumentsMistralOCR APISpeed, reliability, structure preservation
Medical RecordsDeepSeek OCR 2 Fine-tunedPrivacy, domain adaptation, compliance
Handwritten NotesGemini Flash / GPT-4oSuperior handwriting recognition
Multilingual DocsDeepSeek OCR 2 Fine-tunedLanguage adaptation, offline capability
High-volume ProcessingvLLM + DeepSeek OCR 2Batch processing, cost-effective

🤔 FAQ

Q: How does DeepSeek OCR 2 compare to GPT-4o for OCR tasks?

A: DeepSeek OCR 2 excels at structure preservation and doesn't hallucinate, while GPT-4o is better for semantic understanding and handwritten text. For pure document OCR with layout preservation, DeepSeek OCR 2 is often superior and runs locally. For interactive analysis requiring reasoning, GPT-4o is better.

Q: Can I run DeepSeek OCR 2 on AMD GPUs or Apple Silicon?

A: As of January 2026, official support is for NVIDIA GPUs with CUDA. ROCm (AMD) and Vulkan support are under community development. For Apple Silicon, you may need to use CPU inference (significantly slower) or wait for Metal backend support.

Q: What's the difference between "Free OCR" and "grounding" modes?

A:

  • Free OCR (<image>\nFree OCR.): Extracts text without preserving layout
  • Grounding mode (<image>\n<|grounding|>Convert to markdown.): Preserves document structure, tables, and formatting

Use grounding mode for documents where layout matters.

Q: How much VRAM do I need?

A:

  • 4-bit quantization: 8GB VRAM (RTX 3070 or better)
  • Full precision (bfloat16): 16GB VRAM (RTX 4080 or better)
  • Fine-tuning: 24GB+ VRAM recommended (or use Unsloth optimizations)

Q: Is DeepSeek OCR 2 better than PaddleOCR?

A: According to benchmarks, DeepSeek OCR 2 achieves higher accuracy on complex layouts. However, PaddleOCR has a more mature ecosystem and production pipeline. DeepSeek OCR 2 is easier to get started with but PaddleOCR may be better for large-scale deployments with existing infrastructure.

Q: Can I use DeepSeek OCR 2 for commercial projects?

A: Yes, DeepSeek OCR 2 is open-source. Check the license on the Hugging Face repository for specific terms.

Q: How do I handle PDF files?

A: DeepSeek OCR 2 processes images. For PDFs:

  1. Convert PDF pages to images (using pdf2image or similar)
  2. Process each image with DeepSeek OCR 2
  3. Combine results

See the GitHub repository for PDF processing utilities.

Q: What's the best way to improve accuracy for my specific documents?

A:

  1. Preprocessing: Ensure images are properly oriented and high-resolution
  2. Prompt engineering: Use appropriate prompts (<|grounding|> for structured docs)
  3. Fine-tuning: Create 100-500 examples of your document type and fine-tune
  4. Post-processing: Use LLMs to clean up minor errors

Conclusion & Next Steps

Key Takeaways

DeepSeek OCR 2 represents a significant leap in document understanding with its human-like visual reading order via DeepEncoder V2

Multiple deployment options (vLLM, Transformers, Unsloth) make it accessible for various use cases

Fine-tuning capabilities enable 57-86% accuracy improvements for specialized domains

Open-source nature provides full control, privacy, and customization

Competitive performance with SOTA results on complex layouts, tables, and mathematical content

For Researchers & Developers:

  1. ✅ Start with the Hugging Face Transformers implementation for flexibility
  2. ✅ Test on your specific document types
  3. ✅ If accuracy is insufficient, prepare a fine-tuning dataset
  4. ✅ Use Unsloth's free Colab notebook for efficient fine-tuning
  5. ✅ Deploy with vLLM for production workloads

For Production Deployments:

  1. ✅ Benchmark DeepSeek OCR 2 vs MistralOCR for your use case
  2. ✅ Consider privacy requirements (local vs API)
  3. ✅ Set up vLLM for batch processing
  4. ✅ Implement preprocessing (rotation detection, image enhancement)
  5. ✅ Build post-processing pipeline for consistency

For Privacy-sensitive Applications:

  1. ✅ Deploy DeepSeek OCR 2 locally with Transformers or vLLM
  2. ✅ Fine-tune on your specific document types
  3. ✅ Implement secure document handling workflows
  4. ✅ Consider air-gapped deployment for maximum security

Resources

Final Thoughts

DeepSeek OCR 2 democratizes advanced document understanding technology, making it accessible to researchers, developers, and organizations of all sizes. Its combination of cutting-edge architecture, practical performance, and open-source availability positions it as a top choice for 2026 OCR projects.

Whether you're processing research papers, digitizing historical documents, or building production OCR pipelines, DeepSeek OCR 2 offers the flexibility and performance needed for modern document understanding tasks.

💡 Start Today
The fastest way to get started is using the Hugging Face Transformers implementation. Install the dependencies, download the model, and run your first OCR in under 10 minutes.


Last Updated: January 2026
Model Version: DeepSeek-OCR-2 (3B parameters)
License: Check Hugging Face repository for details

DeepSeek OCR 2 Complete Guide