Summaries and Thoughts on Multimodal Alignment and Generation

5 minute read

Published: April 27, 2025

The multimodal generation blog covers innovative models such as LAFITE, CAFE, ARTIST, and LLaVA-Reward, which aim to improve text-to-image generation through methods on generalization ability, better multimodal alginment and enhanced text rendering.

LAFITE: Language-Free Text-to-Image Generation

Conventional text-to-image training requires vast image-caption pairs. LAFITE eliminates this dependency by training on image-only data, addressing the high cost of annotation and enabling model scalability in low-resource domains. LAFITE uses CLIP to extract image features and generates pseudo-text features through noise perturbation. These are injected into a StyleGAN2 generator, and contrastive losses ensure alignment between image outputs and pseudo-text embeddings—without using real captions. LAFITE achieves competitive performance on MS-COCO and outperforms DALL-E using only 1% of its model size and training data. This approach opens new paths for building efficient, domain-specific models from unlabeled image collections.

Customization Assistant for Text-to-Image Generation

Users often want to generate images containing new or personalized concepts (e.g., pets, faces) without retraining models. Existing solutions require fine-tuning, which is slow and resource intensive. This work proposes an interactive, efficient alternative. The proposed system, CAFE, combines a multimodal LLM and a diffusion model. It takes a user-provided image and text prompt, infers the intended concept using the LLM, and generates a conditioned image. It supports multi-turn dialogue and provides natural language explanations. CAFE enables real-time, fine-tuning-free personalization and outperforms existing non-finetuned baselines on benchmarks like DreamBench. Its conversational interface enhances usability and aligns well with user intent, moving text-to-image generation toward more intuitive applications.

ARTIST: Improving the Generation of Text-Rich Images

Existing diffusion models generate realistic visuals but struggle to render legible text, limiting applications like graphic design or signage. ARTIST addresses this by enabling accurate text rendering in generated images without sacrificing overall quality. ARTIST proposes a two-stage diffusion pipeline: a textual diffusion model is trained on synthetic data to learn text structure, while a visual diffusion model integrates this text knowledge via feature injection. A large language model (LLM) further guides the system by identifying textual content in prompts. ARTIST achieves significantly improved readability of text in generated images and outperforms prior models by up to 15% on dedicated benchmarks. The architecture’s disentangled design enables focused improvements and practical deployment in text-sensitive domains.

LLaVA-Reward: Multimodal Reward Modeling for T2I Evaluation

Evaluating text-to-image (T2I) outputs across multiple criteria—alignment, safety, fidelity—is labor-intensive and inefficient. Existing models rely heavily on prompts or token scoring, limiting scalability. LLaVA-Reward addresses this by using hidden states of pretrained MLLMs to provide multi-perspective evaluations efficiently. LLaVA-Reward augments a lightweight MLLM (e.g., Phi-3.5-vision) with LoRA adapters and a novel Skip-connection Cross-Attention (SkipCA) module. It processes image-text pairs through a visual encoder and predicts scalar reward scores using the EOS token’s hidden state, with preference learning trained via pairwise ranking loss (Bradley-Terry). LLaVA-Reward delivers state-of-the-art performance on MJ-Bench, TIFA160, and UnsafeBench benchmarks, outperforming CLIP-based and VQA-based models while offering better inference-time efficiency. It also improves image generation quality through diffusion inference-time scaling. The model is adaptable to different perspectives using LoRA, making it practical for scalable reward modeling in T2I tasks.

Thoughts on Multimodal Large Language Models

Post-Training Limits Their Potential

Chameleon and LLaMA-Fusion both build on top of pre-trained, text-only language models. These base LLMs are later adapted for multimodal tasks through fine-tuning. While this leverages strong language capabilities, it introduces limitations:

Inherited Constraints: The base LLMs weren’t designed for images. Their architecture and data were optimized for text, so multimodal adaptation feels like a retrofit, limiting seamless integration of visual inputs.
Performance Ceiling: Unified MLLMs often underperform specialized systems. Vision models like CLIP or generation models like Stable Diffusion excel at their specific tasks, while MLLMs must compromise. As a result, they typically lag behind SoTA models in both understanding and generation.

Bottlenecks in Chameleon and LLaMA-Fusion

Chameleon’s VQ-VAE Bottleneck:
- Information Loss: The compression discards fine visual details—hurting generation quality.
- Scalability Limits: Expanding the codebook to capture more nuance requires heavy compute, capping performance.
LLaMA-Fusion’s Pretrained Model Constraint:
- Misaligned multimodal tokenizer: Different modality encoders should share the same embedding space, while LLaMA-Fusion cannot fully bridge the gap between modalities.
- Limited Adaptability: Keeping LLM weights frozen preserves text skills but prevents deeper cross-modal alignment.

Multimodal Pretraining + Diffusion Head

Multimodal Pretraining with Better Visual Tokenizers
- Train the LLM from scratch on both text and images. Instead of VQ-VAE, use more expressive visual tokenizers, like continuous embeddings (e.g., from Vision Transformers) preserve richer visual detail. Pretraining on large multimodal datasets enables the model to learn aligned representations from the ground up. This unified tokenizer tackles a fundamental multimodal alignment problem, where a novel architecture or training paradigm is under exploration.
- Illume+ (https://illume-unified-mllm.github.io) provides an intermediate solution, where two visual encoders are used to balance semantic and pixel-level info.
Add a Diffusion Head for Generation
- Once pretrained, attach a diffusion model to generate images: Diffusion excels at detail and realism, outperforming VQ-based generation. It’s better at controllability, producing outputs that more accurately reflect text prompts. This setup combines deep understanding from the MLLM with high-fidelity generation from the diffusion head.

Classical Models vs. Modern Approaches and Feasible Solutions

Classical Methods (e.g., T5 + Diffusion):
- They use text-trained encoders to drive generation, often leading to poor alignment and weak control—images may not match prompts well.
Modern Unified Models (OpenAI GPT-4o and Google Gemini):
- Treat all inputs—text and vision—as tokens in a single sequence. This end-to-end architecture learns cross-modal dependencies natively, improving both understanding and controllability. It’s compute-heavy, but it works.

Summary

Current MLLMs is retrofitted and constrained—limited by tokenization (Chameleon) or rigidity (LLaMA-Fusion). Classical generation models struggle with alignment and world knowledge. Unified, token-level models like GPT-4o or Gemini point the way forward. The feasible solution is to pretrain a multimodal LLM, then attach a diffusion head for top-tier generation quality and great controllability.

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Ruiyi Zhang