Welcome!

Ruiyi Zhang is currently a Research Scientist at Adobe Research. His recent work contributes to the development foundational models (such as LAFITE, LLaVAR and AutoDAN). He has been developing GenAI features for Adobe Acrobat AI Assistant and Firefly.

His research interests include multimodal, natural language processing, and reinforcement learning. More specifically, he currently focuses on the subfields:

Multimodal LLM for text-rich image understanding and reasoning

Structured information such as stylized texts, diagrams, tables, and charts presents significant challenges for text-rich image understanding and reasoning. My recent papers have focused on enhancing the "reading ability" of MLLMs:

LLaVAR - Visual text comprehension
LLaVA-Read - Complex layout understanding
TextLap - Layout planning via coordinates generation
TRIG - Visual text grounding for multimodal QA attributions
SV-RAG - Handling long contexts where documents usually have hundreds of pages.

Building large-scale synthetic multimodal data

Building large-scale and high-quality training data is essential for multimodal models. I have been working on text-rich image instruction tuning dataet (TRINS), evaluation benchmarks (MMR) and data filtering mechanism (LLaVAR-2). In addition, we investigated how to build image editing and customization dataset efficiently (Toffee).

Multimodal generative models aligned with user preference

Multimodal alignment is a fundamental research problem for better generation controllability. LAFITE first attacked this topic exploiting unCLIP to generate pseudo text features given images. ARTIST and CAFE aim to integrate LLM's knowledge into creation process for better generalization ability.

Multimodal Agents learning from user interactions

I am recently interested in two problems: (1) How to enable multimodal agents to efficiently interact with real environments and learn from it (tool-usage, memory, and reward modeling). (2) How to enable LLMs to better understand their action space and generate accurate actions.

I am also broadly interested in LLM Agents, Alignment & Safety, Retrieval-Augmented Generation (RAG) and Diffusion Model applications.

Earlier Research Interests

Ruiyi Zhang’s Ph.D. research initially focused on uncertainty estimation, specifically employing variational methods to approximate complex distributions, with key contributions highlighted in his ICML 2019 invited talk [Slides]. Then, his research shifted on designing interactive machine learning algorithms, which can be broadly applied in real-world problems, such as text generation and vision-language recommendation. His thesis is about [Uncertainty Estimation in Deep Reinforcement Learning].

Experience

Ruiyi Zhang obtained his Ph.D. from the Department of Computer Science, Duke University. His thesis is about Uncertainty Estimation in Deep Reinforcement Learning and his Ph.D. advisor is Professor Lawrence Carin. Ruiyi received his B.Sc. degree at Nanjing University in 2016, where he was a member of the LAMDA Group, led by Dr. Zhi-Hua Zhou.

He worked as a research intern at Google Brain (Mountain View, Summer 2019), Samsung Research America (Mountain View, Spring 2019 & 2020), Adobe Research (San Jose, Summer 2018), and Alibaba AntAI (Hangzhou, Summer 2016).

Collaborations

I am always happy to collaborate with enthusiastic and talented students on topics related to multimodal large language models and reinforcement learning. If you are interested in working with me, feel free to reach out. previous interns/collaborators