Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

When humans describe a visual scene, they do not process the entire image uniformly; instead, they selectively fixate on regions relevant to their intended description. In contrast, current multimodal large language models (MLLMs) attend to all visual tokens at each generation step, leading to dilut...

Read Original Article →

Source

http://arxiv.org/abs/2605.13080v1