The500Feed.Live

Everything going on in AI - updated daily from 500+ sources

← Back to The 500 Feed
📄 ResearchJune 24, 2026

Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models

Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (op...

Read Original Article →

Source

http://arxiv.org/abs/2606.26079v1