Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding

Vision-Language Models (VLMs) parse documents end-to-end but frequently break down on layouts unlike those seen in training. We attribute this to a two-hop bottleneck: before the decoder can extract content (Hop 2), it must first classify and localize the enclosing layout entity (Hop 1), and when th...

Read Original Article →

Source

http://arxiv.org/abs/2605.19866v1