Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in t...

Read Original Article →

Source

http://arxiv.org/abs/2605.27101v1