generativeai / vlm-ocr / README.md
OCR/ VLM
Table of Contents (ToC)
Open-Source OCR models and toolkit - Document AI
- Chandra: Chandra is a highly accurate OCR model that converts images and PDFs into structured HTML/Markdown/JSON while preserving layout information.
- SmolDocling: An Ultra-Compact, Open-Source Vision-Language Model for Document Conversion
- OlmOCR: A toolkit for converting PDFs and other image-based document formats into clean, readable, plain text format.Support for equations, tables, handwriting, and complex formatting.Convert into text with a natural reading order, even in the presence of figures, multi-column layouts, and insets
- Dots.ocr: A layout-aware OCR model designed for dense documents like scientific papers and multilingual forms and use LayoutLM-style models for structured understanding.
- DeepSeek OCR: Vision-Language (VL) Model designed for general multimodal understanding capabilities, capable of processing logical diagrams, web pages, formula recognition, scientific literature, natural images, and embodied intelligence in complex scenarios.
- Docling: A toolkit for document linguistics and OCR, tailored for historical and multilingual texts. Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, …). Support page layout, reading order, table structure, code, formulas, image classification.
- TrOCR: Microsoft’s transformer-based OCR model for printed and handwritten text. Available on Hugging Face with multilingual support.
- DocTR: A PyTorch-based OCR pipeline with detection and recognition modules for invoices, forms, and structured documents.
- PaddleOCR– Lightweight and multilingual, optimized for mobile and embedded systems. - Multilingual Document Parsing via a 0.9B VLM, Universal Scene Text Recognition, Intelligent Information Extraction
- LayoutLMv3: Combines text, layout, and visual features for structured document understanding for receipts, invoices, and forms
- Nanonets-OCR2: Image-to-markdown OCR model. It transforms documents into structured markdown with semantic tagging, LaTeX equation recognition, and intelligent content parsing, making it ideal for LLM-ready workflows and complex document understanding
- Qwen3-VL: Alibaba’s multimodal model with strong OCR and layout capabilities. Useful for mixed-content documents
- LLaVA-1.6: Vision-language assistant with OCR capabilities, enabling document Q&A and visual reasoning
- Gemma-3 Vision: Google’s open multimodal model tuned for document tasks and layout-aware OCR
- Mistral-OCR: A fast, lightweight model optimized for printed and handwritten text.
Others References