The500Feed.Live
Everything going on in AI - updated daily from 500+ sources
📄 ResearchJune 24, 2026
Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models
Standard benchmarks for multimodal large language models (MLLMs) score each item on one canonical ordering and miss whether order-irrelevant shuffling changes the answer, a baseline reliability property called for by emerging AI evaluation guidelines. We introduce Facet-Probe, a five-facet audit (op...
Read Original Article →