Blog posts

2025

Summaries on Multimodal LLMs for Text-rich Image Understanding

4 minute read

Published: April 27, 2025

We summarize multimodal understanding papers and delve into models like LLaVAR, TRINS, LaRA, LLaVA-Read, and SV-RAG, which focus on enhancing text-rich image comprehension.

Summaries and Thoughts on Multimodal Alignment and Generation

5 minute read

Published: April 27, 2025

The multimodal generation blog covers innovative models such as LAFITE, CAFE, ARTIST, and LLaVA-Reward, which aim to improve text-to-image generation through methods on generalization ability, better multimodal alginment and enhanced text rendering.

Ruiyi Zhang

Blog posts

2025

Summaries on Multimodal LLMs for Text-rich Image Understanding

Summaries and Thoughts on Multimodal Alignment and Generation