2025 Complete Guide: How to Build End-to-End OCR with HunyuanOCR

🎯 Key Takeaways (TL;DR)

  • A single 1B multimodal architecture covers detection, recognition, parsing, translation, and more in one unified OCR pipeline.
  • Dual inference paths (vLLM + Transformers) plus well-crafted prompts make rapid production deployment straightforward.
  • In-house benchmarks show consistent gains over traditional OCR and general-purpose VLMs across spotting, document parsing, and information extraction.

Table of Contents

  1. What Is HunyuanOCR?
  2. Why Is HunyuanOCR So Strong?
  3. How to Deploy HunyuanOCR Quickly?
  4. How to Design Business-Ready Prompts?
  5. What Performance Evidence Exists?
  6. How Does the Inference Flow Work?
  7. FAQ
  8. Summary & Action Plan

What Is HunyuanOCR?

HunyuanOCR is Tencent Hunyuan’s end-to-end OCR-specific vision-language model (VLM). Built on a native multimodal architecture with only 1B parameters, it reaches state-of-the-art results on text spotting, complex document parsing, open-field information extraction, subtitle extraction, and image translation.

Best Practice
Whenever you must process multilingual, multimodal, and complex layouts in one shot, prioritize an “single-prompt + single-inference” end-to-end model to cut pipeline latency drastically.

HunyuanOCR Overview

Why Is HunyuanOCR So Strong?

Lightweight, Full-Modal Coverage

  • 1B native multimodal design: achieves SOTA quality with self-developed training strategy while keeping inference cost low.
  • Task completeness: detection, recognition, parsing, info extraction, subtitles, and translation all handled within one model.
  • Language breadth: supports 100+ languages across documents, street views, handwriting, tickets, etc.

True End-to-End Experience

  • Single prompt → single inference: avoids cascading OCR error accumulation.
  • Flexible output: coordinates, LaTeX, HTML, Mermaid, Markdown, JSON—choose whatever structure you need.
  • Video-friendly: extracts bilingual subtitles directly for downstream translation or editing.

Multi-Scenario Performance

💡 Pro Tip
Tailor prompts to your business format (HTML tables, JSON fields, bilingual subtitles) to unleash structured outputs from the end-to-end pipeline.

How to Deploy HunyuanOCR Quickly?

System Requirements

  • OS: Linux
  • Python: 3.12+
  • CUDA: 12.8
  • PyTorch: 2.7.1
  • GPU: NVIDIA CUDA GPU with 80GB memory
  • Disk: 6GB

vLLM Deployment (Recommended)

  1. pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
  2. Load tencent/HunyuanOCR plus AutoProcessor.
  3. Build messages containing image + instruction, then call apply_chat_template for the prompt.
  4. Configure SamplingParams(temperature=0, max_tokens=16384).
  5. Invoke llm.generate and run post-processing (e.g., clean_repeated_substrings).

Transformers Deployment

  1. Install the pinned branch: pip install git+https://github.com/huggingface/transformers@82a06d...
  2. Use HunYuanVLForConditionalGeneration with AutoProcessor.
  3. Call model.generate(..., max_new_tokens=16384, do_sample=False).
  4. Note: This path currently trails vLLM in performance (official fix in progress).

⚠️ Heads-up
README scripts default to bfloat16 and device_map="auto". In multi-GPU setups, ensure memory sharding is deliberate to avoid distributed OOM.

How to Design Business-Ready Prompts?

Task Prompt Cheat Sheet

TaskEnglish PromptChinese Prompt
SpottingDetect and recognize text in the image, and output the text coordinates in a formatted manner.检测并识别图片中的文字,将文本坐标格式化输出。
Document ParsingIdentify formulas (LaTeX), tables (HTML), flowcharts (Mermaid), and parse body text in reading order.识别图片中的公式/表格/图表并按要求输出。
General ParsingExtract the text in the image.提取图中的文字。
Information ExtractionExtract specified fields in JSON; extract subtitles.提取字段并按 JSON 返回;提取字幕。
TranslationFirst extract text, then translate; formulas → LaTeX, tables → HTML.先提取文字,再翻译;公式用 LaTeX,表格用 HTML。

Prompting Principles

  1. Structure first: explicitly request JSON/HTML/Markdown to reduce post-processing.
  2. Field enumeration: list all keys for information extraction to avoid missing items.
  3. Language constraints: specify target language for translation/subtitle tasks.
  4. Redundancy cleanup: apply substring dedupe helpers on long outputs.

Information Extraction Sample

What Performance Evidence Exists?

Text Spotting & Recognition (In-house Benchmark)

Model TypeMethodOverallArtDocGameHandAdsReceiptScreenSceneVideo
TraditionalPaddleOCR53.3832.8370.2351.5956.3957.3850.5963.3844.6853.35
TraditionalBaiduOCR61.9038.5078.9559.2459.0666.7063.6668.1855.5367.38
General VLMQwen3VL-2B-Instruct29.6829.4319.3720.8550.5735.1424.4212.1334.9040.10
General VLMQwen3VL-235B-Instruct53.6246.1543.7848.0068.9064.0147.5345.9154.5663.79
General VLMSeed-1.6-Vision59.2345.3655.0459.6867.4665.9955.6859.8553.6670.33
OCR VLMHunyuanOCR70.9256.7673.6373.5477.1075.3463.5176.5864.5677.31

Document Parsing (OmniDocBench + Multilingual Benchmarks)

TypeMethodSizeOmni OverallTextFormulaTableWild OverallTextFormulaTableDocML
General VLMGemni-2.5-pro-88.030.07585.9285.7180.590.11875.0378.5682.64
General VLMQwen3-VL-235B235B89.150.06988.1486.2179.690.09080.6768.3181.40
Modular VLMMonkeyOCR-pro-3B3B88.850.07587.5086.7870.000.21163.2767.8356.50
Modular VLMMinerU2.51.2B90.670.04788.4688.2270.910.21864.3770.1552.05
Modular VLMPaddleOCR-VL0.9B92.860.03591.2290.8972.190.23265.5474.2457.42
End-to-End VLMMistral-OCR-78.830.16482.8470.03----64.71
End-to-End VLMDeepseek-OCR3B87.010.07383.3784.9774.230.17870.0770.4157.22
End-to-End VLMdots.ocr3B88.410.04883.2286.7878.010.12174.2371.8977.50
End-to-End VLMHunyuanOCR1B94.100.04294.7391.8185.210.08182.0981.6491.03

Information Extraction & VQA

ModelCardsReceiptsVideo SubtitlesOCRBench
DeepSeek-OCR10.0440.545.41430
PP-ChatOCR57.0250.263.10-
Qwen3-VL-2B67.6264.623.75858
Seed-1.6-Vision70.1267.5060.45881
Qwen3-VL-235B75.5978.4050.74920
Gemini-2.5-Pro80.5980.6653.65872
HunyuanOCR92.2992.5392.87860

Image Translation

MethodSizeOther2EnOther2ZhDoTA (en2zh)
Gemini-2.5-Flash-79.2680.0685.60
Qwen3-VL-235B235B73.6777.2080.01
Qwen3-VL-8B4B75.0975.6379.86
Qwen3-VL-4B4B70.3870.2978.45
Qwen3-VL-2B2B66.3066.7773.49
PP-DocTranslation-52.6352.4382.09
HunyuanOCR1B73.3873.6283.48

💡 Pro Tip
For multilingual invoices, IDs, or subtitles, HunyuanOCR’s leadership on Cards/Receipts/Subtitles makes it a strong first choice.

How Does the Inference Flow Work?

📊 Implementation Flow

Best Practice
Add guardrails at the end (empty-output detection, JSON schema validation) to shield downstream systems from malformed results.

🤔 FAQ

Q: How much GPU memory is required?

A: 80GB is recommended for 16K-token decoding. For smaller GPUs, reduce max_tokens, downsample images, or enable tensor parallelism.

Q: What’s the gap between vLLM and Transformers?

A: vLLM delivers better throughput and latency today and is the preferred path. Transformers currently lags but is ideal for custom ops or debugging until the fix lands upstream.

Q: How do I guarantee structured outputs?

A: Define the exact schema in the prompt, validate responses (regex/JSON schema), and apply helper functions like clean_repeated_substrings from the README.

Summary & Action Plan

  • For multilingual, multi-format OCR workloads, evaluate HunyuanOCR’s single-model pipeline first to cut architectural complexity.
  • Start with the vLLM recipe for fast PoC using the provided prompts and scripts, then iterate on prompt engineering and post-processing to meet production specs.
  • Dive deeper via the HunyuanOCR Technical Report, Hugging Face demo, or by reproducing the visual examples from the README.

Complex Document Parsing Sample


References

HunyuanOCR Guide

Tags:
HunyuanOCR
OCR
AI
Machine Learning
Back to Blog
Last updated: November 25, 2025