OCR/ VLM

Table of Contents (ToC)

Table of Contents (ToC)
Open-Source OCR models and toolkit - Document AI
Others References

Open-Source OCR models and toolkit - Document AI

Chandra: Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.
SmolDocling: An Ultra-Compact, Open-Source Vision-Language Model for Document Conversion
- Paper
- GitHub
- HF - Demo
- Model Card
OlmOCR: A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.Support for equations, tables, handwriting, and complex formatting.Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets
- Paper
Dots.ocr: A layout-aware OCR model designed for dense documents like scientific papers and multilingual forms and use LayoutLM-style models for structured understanding.
DeepSeek OCR: Vision-Language (VL) Model designed for general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
Docling: A toolkit for document linguistics and OCR, tailored for historical and multilingual texts. Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, …). Support page layout, reading order, table structure, code, formulas, image classification.
TrOCR: Microsoft’s transformer-based OCR model for printed and handwritten text. Available on Hugging Face with multilingual support.
DocTR: A PyTorch-based OCR pipeline with detection and recognition modules for invoices, forms, and structured documents.
PaddleOCR– Lightweight and multilingual, optimized for mobile and embedded systems. - Multilingual Document Parsing via a 0.9B VLM, Universal Scene Text Recognition, Intelligent Information Extraction
LayoutLMv3: Combines text, layout, and visual features for structured document understanding for receipts, invoices, and forms
Nanonets-OCR2: Image-to-markdown OCR model. It transforms documents into structured markdown with semantic tagging, LaTeX equation recognition, and intelligent content parsing, making it ideal for LLM-ready workflows and complex document understanding
Qwen3-VL: Alibaba’s multimodal model with strong OCR and layout capabilities. Useful for mixed-content documents
LLaVA-1.6: Vision-language assistant with OCR capabilities, enabling document Q&A and visual reasoning
Gemma-3 Vision: Google’s open multimodal model tuned for document tasks and layout-aware OCR
Mistral-OCR: A fast, lightweight model optimized for printed and handwritten text.
- Mistral - Mistral OCR
- Mistral doc

Others References

LandingAI