Summaries on Multimodal LLMs for Text-rich Image Understanding

4 minute read

Published:

We summarize multimodal understanding papers and delve into models like LLaVAR, TRINS, LaRA, LLaVA-Read, and SV-RAG, which focus on enhancing text-rich image comprehension.

LLaVAR: Visual Instruction Tuning for Text-Rich Images

LLaVAR is the first MLLM in the world for text-rich image, which can handle both text-rich image and natural image understanding. LLaVAR extends the LLaVA architecture by targeting text-rich image understanding through data augmentation rather than architectural changes. Starting with 422K likely-textual images from LAION, we extracted OCR text and used GPT-4 to generate 16K multi-turn Q&A conversations, which were added to the instruction tuning set. This dataset significantly improved performance on text-based VQA benchmarks, achieving up to 20% accuracy gains.

LLaVAR’s experiments emphasized resolution’s importance for reading: small text is often lost with 224×224 encoders. To overcome this, we stacked CLIP encoders to simulate higher resolution and used external OCR/captioning tools to pre-process images. Text summaries were fed to the LLM alongside visual tokens, leading to a hybrid setup that improved reading without expanding the token budget. The study influenced later designs by showing how OCR tools and data-driven tuning alone can substantially boost an MLLM’s ability to “read” without changing its backbone.

TRINS and LaRA: Instruction Dataset + Tool-aware Model

LLaVAR uses LLMs to generate high-quality text-rich instruction data, and the quality is not always satisfying since LLMs are only provided with OCR words and original short captions. To tackle this problem, TRINS constructed a large-scale dataset focused on text-rich images, combining human-written captions and LLM-generated QA pairs across 50K images. We spent significant efforts for data sourcing from LAION-Highres subset, using multiple models and heuristic rules.

LaRA, built on LLaVA, uses PaddleOCR to extract image text and injects it into the text prompt. This “OCR-as-input” strategy gives models direct access to text that would otherwise be missed due to vision resolution limits. The OCR results are merged with the instruction, and training freezes the vision encoder while tuning the LLM and projection layer. LaRA achieved SOTA on benchmarks like TextVQA and DocVQA, even performing well without OCR inputs. The success showed that combining high-quality data and lightweight text integration strategies can deliver strong reading capabilities without rearchitecting the entire model.

TRINS is useful for text-rich image generation and evaluation as well. We have used this dataset to obtain SoTA MLLMs twice during 2023 and 2024 on the OCRBench leaderboard.

LLaVA-Read: Dual Visual Encoders and OCR + Layout Awareness

LLaVA-Read addresses key MLLM limitations—low text resolution and lack of layout awareness—by using three encoders: one low-resolution ViT-based CLIP encoder, a high-resolution Conv-based CLIP encoder, and a high-res OCR pipeline as visual text encoder. Visual features are merged using intermediate-layer fusion to keep token count constant, while merging high-resolution details into visual tokens. In visual text encoders, OCR outputs are tokenized with special spatial markers and appended to the LLM input.

This dual-path setup enables the model to attend both to rich visual context and structured text information. A layout-aware training phase ensures better alignment across modalities. The model outperforms prior methods on complex benchmarks requiring both text comprehension and spatial reasoning. The hybrid design proved effective at selectively leveraging OCR for long text while still using visual cues for short labels and layout-sensitive tasks.

MMR: Benchmarking MLLM Reading Comprehension in Images

MMR was introduced to expose gaps in reading capabilities of MLLMs, especially on tasks requiring more than simple OCR. It covers 11 types of tasks, from spatial reasoning to font identification and text grounding, with 550 human-written Q&A pairs.

Evaluations showed that many models performed poorly on visual text grounding, layout reasoning, or comparing multiple text blocks. Even top-performing MLLMs struggled. MMR highlights how existing benchmarks underestimated the difficulty of text-rich reasoning and provides a granular framework for evaluating improvements in models. Internally, it has become a key diagnostic tool for gauging “reading IQ” of MLLMs before deployment in document-oriented applications.

SV-RAG: Efficient Long-Doc QA with Visual Retrieval

SV-RAG addresses the challenge of answering questions over multi-page documents. Instead of using a separate retriever, it trains LoRA adapters within an MLLM to handle both retrieval and answering. A shared MLLM backbone is “switched” between retrieval and QA using adapter weights, enabling end-to-end visual RAG.

The retriever uses ColBERT-style late interaction across visual tokens to rank relevant pages. Critically, pages are treated as images, letting the model leverage both layout and visual cues. SV-RAG achieved large performance gains and 8× speedup on long-doc tasks like SlideVQA by avoiding exhaustive page-level inference. The approach scales well, adds minimal parameters, and enables vision-aware retrieval using the same model backbone—paving the way for efficient multimodal document understanding pipelines.