GAVEL: Grounded Caption Error Verification and Localization

Vision-language models (VLMs) often produce hallucinated or inconsistent outputs, where text and images are not properly aligned. Addressing this issue requires not only detecting misalignment but also explaining the discrepancy and localizing its visual evidence. We introduce GAVEL (Grounded Captio...

Read Original Article →

Source

http://arxiv.org/abs/2606.26923v1