AI News Archive: May 21, 2026 — Part 19

Sourced from 500+ daily AI sources, scored by relevance.

Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping
Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies p...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22578v1
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22570v1
Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain
Training Deep Neural Networks for tracking individual cells in biomedical videos requires a large amount of annotated data. The annotation of videos for cell tracking is very time consuming and often requires domain expertise; this explains the limited availability of public annotated data to addres...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22563v1
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-ho...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22558v1
Visual Usability Checker
Validate your design decisions instantly with AI insights
🧰 ToolsMay 21, 2026https://www.producthunt.com/products/attention-insight?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding
Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22550v1
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement
Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by historical similar cases and...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22547v1
Predicting 30-Day Heart Failure Readmissions Using Machine Learning: Insights From the Kansas Health Information Network (KHIN)
Background: Heart failure (HF) is a major contributor to inpatient hospital utilization, with persistently high 30-day readmission rates. Existing prediction tools are frequently restricted to primary-diagnosis HF admissions, potentially excluding clinically relevant HF-related hospitalizations. Objectives: To develop and validate risk prediction models using machine learning (ML)-based risk prediction models to predict 30-day readmissions among patients with HF using the Kansas Health Information Network, a statewide health information exchange. Methods: This retrospective cohort study analyzed HF hospitalizations using predictors including demographics, comorbidities, laboratory results, medications, clinical quality metrics for diabetes and kidney disease management, and prior healthcare utilization. Five ML models, including regularized logistic regression, random forest, extreme gradient boosting, categorical boosting, and deep neural network, were trained using stratified 5-fold
📄 ResearchMay 21, 2026https://www.medrxiv.org/content/10.64898/2026.05.18.26353537v1?rss=1
Interpretable Symptom-Based Machine Learning for Parkinson's Disease Prediction: A Feasibility Study
Background: Parkinson's disease (PD) has a prolonged prodromal phase during which non-motor symptoms (NMS) may emerge years before the appearance of classical motor signs. This makes NMS a promising and clinically accessible source of information for early risk stratification. Objective: In this study, we investigated whether NMS alone can serve as reliable predictors of PD risk using clinical data from the Parkinson's Progression Markers Initiative (PPMI) cohort. Methods: We developed a stacked ensemble machine learning framework that integrates feature-level modelling, a global multivariate model, and a patient-similarity component to capture complementary patterns within NMS profiles. The model was trained using leakage-controlled patient-level validation and evaluated on an independent held-out test set. Results: The final ensemble achieved strong predictive performance, with an area under the ROC curve of 0.955, sensitivity of 0.929, and specificity of 0.900. Explainability analys
📄 ResearchMay 21, 2026https://www.medrxiv.org/content/10.64898/2026.05.15.26352866v1?rss=1
Automated Macrolinguistic Discourse Analysis for Transdiagnostic Detection of Language Impairments
Macrolinguistic discourse analysis offers valuable insight into how patients with neurogenic communication disorders organize and produce informative speech, yet it remains a largely manual and labor-intensive process. We report an automated pipeline for macrolinguistic discourse analysis for individuals with aphasia and dementia that integrates automatic speech recognition (ASR), utterance segmentation, sentence-level embeddings, centroid-based main-concept matching, and rule-based coherence error classification. These algorithms were applied to Cinderella story retellings from 309 participants (113 controls, 102 post-stroke aphasia (PWA), and 94 dementia). The algorithm reliably identified main concepts (83% accuracy against human labels) and derived interpretable features such as semantic distance to a main concept centroid, main concept coverage, and coherence error rates. Crucially, diagnostic classification results showed that logistic-regression classifiers trained on 10 macroli
📄 ResearchMay 21, 2026https://www.medrxiv.org/content/10.64898/2026.05.19.26353614v1?rss=1
Economic costing of evaluating, deploying and monitoring an artificial intelligence-based reconstruction for acceleration of rectal MRI examinations
Objectives: AI-based reconstructions can reduce MRI acquisition times and/or improve image quality. Guidelines recommend clinical evaluations and post-deployment monitoring of these novel methods, however, there has been little investigation of the clinical resources required for such assessments. The aim of this study was to evaluate the healthcare resource utilisation and potential savings associated with AI-based reconstructions in rectal MRI. Methods: A retrospective economic costing analysis was conducted from the NHS healthcare perspective. Resource utilisation data were extracted from the Electronic Patient Records for 9 healthy volunteer scans and 104 rectal MRI examinations evaluating an AI-based reconstruction. The resource profile included the MRI scan and the staff time required for data acquisition and analysis. Results: The clinical evaluation of the AI-based reconstruction cost {pound}15,023. Deployment of the AI-based reconstruction reduced the length of an MRI rectum s
📄 ResearchMay 21, 2026https://www.medrxiv.org/content/10.64898/2026.05.18.26353474v1?rss=1
Developing Provider-Co-Created Prototypes Addressing Equity-Related Barriers in Liver Transplantation for Hepatocellular Carcinoma"
Background: Black patients and individuals with low socioeconomic status (SES) face significant disparities in accessing curative therapies for hepatocellular carcinoma (HCC), including liver transplantation. This study aimed to develop provider-co-created intervention prototypes in response to patient-identified barriers and recommendations. Methods: A human-centered design session with hepatology and transplant providers at a large academic medical center was conducted. Prior to the session, participants were presented with barriers and preliminary solutions identified through an earlier human-centered design session with Black and low-SES patients. Using structured ideation methods, including brainwriting, challenge mapping, and concept voting, providers co-created intervention prototypes. Final concepts were synthesized from patient insights, provider input, and design methods using affinity diagramming and concept modeling. Results: Nine providers participated in the session. They
📄 ResearchMay 21, 2026https://www.medrxiv.org/content/10.64898/2026.05.15.26353301v1?rss=1
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit an...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22816v1
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22812v1
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial r...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22777v1
Spectral Tail Auxiliary Learning for AI-Generated Image Detection
As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increasingly challenging. Many existing methods exploit frequency-domain cues for detection, typically described as frequency-domain artifacts or h...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22751v1
WorldKV: Efficient World Memory with World Retrieval and Compression
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks re...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22718v1
Swift Sampling: Selecting Temporal Surprises via Taylor Series
While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame s...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22678v1
Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment
Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22677v1
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent v...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22671v1
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Ro...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22668v1
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22658v1
From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder
Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consiste...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22649v1
GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT
Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22619v1
Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following
Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22607v1
Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure
Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still re...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22591v1
SceneAligner: 3D-Grounded Floorplan Localization in the Wild
Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22581v1
AlliHat
Claude AI in your Safari sidebar
🧰 ToolsMay 21, 2026https://www.producthunt.com/products/allihat?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation
Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimension...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22572v1
FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22552v1
Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22538v1
LACO: Adaptive Latent Communication for Collaborative Driving
Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordinatio...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22504v1
Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline
Fine-grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long-tailed distribution and strong variation in image acquisition conditions. We propose a training-free two-stage fram...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22492v1
Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling
Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22484v1
Matching with Deliberation: Test-Time Evolutionary Hierarchical Multi-Agents for Zero-Shot Compositional Image Retrieval
Zero-Shot Compositional Image Retrieval (ZS-CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitutes the core challenge of the task. Existing methods often suffer from Perception M...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22478v1
MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation
Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals usi...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22469v1
Making the Discrete Continuous: Synthetic RAW Augmentations for Fine-Grained Evaluation of Person Detection Performance in Low Light
Real-world deployment of AI vision models is both fueled and limited by the data available for training and testing. Real datasets are sparse and uneven: long-tailed or unbalanced distributions hinder generalization, and the low number of samples in low density regions makes it hard to run evaluatio...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22455v1
Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
While large vision-language-action (VLA) models and generative world models (WM) have advanced long-horizon embodied intelligence, their practical deployment remains challenged by uncertainty in learning-based action generation. Low-quality actions may cause physical failures during execution or lea...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22446v1
Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration
Exploration is a prerequisite for learning useful behaviors in sparse-reward, long-horizon tasks, particularly within 3D environments. Curiosity-driven reinforcement learning addresses this via intrinsic rewards derived from the mismatch between the agent's predictive model of the world and reality....
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22814v1
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
Robustness, domain adaptation, photometric and occlusion invariance, compositional generalisation, temporal robustness, alignment safety, and classical anisotropic regularisation are usually treated as separate problems with separate method families. This paper argues that much of their shared struc...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22800v1
LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
Large language model (LLM)-based multi-agent systems increasingly rely on intermediate communication to coordinate complex tasks. While most existing systems communicate through natural language, recent work shows that latent communication, particularly through transformer key-value (KV) caches, can...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22786v1
Cyber-Physical Anomaly Detection in IoT-Enabled Smart Grids Using Machine Learning and Metaheuristic Feature Optimization
Modern smart grids rely on dense measurement infrastructures, communication links, and intelligent field devices. Although this improves supervision and control, it also increases vulnerability to cyber-physical disruptions. Operators must distinguish physical incidents, such as faults or line distu...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22749v1
Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning
Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, pr...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22748v1
Plug-in Losses for Evidential Deep Learning: A Simplified Framework for Uncertainty Estimation that Includes the Softmax Classifier
Real-world sensor-based learning systems require uncertainty estimation that is both reliable and computationally efficient. Evidential Deep Learning (EDL) provides single-pass uncertainty estimation by modeling the class probabilities via Dirichlet distributions, where the Dirichlet parameters are ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22746v1
SeqLoRA: Bilevel Orthogonal Adaptation for Continual Multi-Concept Generation
Parameter-efficient fine-tuning enables fast personalization of text-to-image diffusion models, but composing multiple custom concepts remains challenging due to representation interference. Existing modular methods either rely on expensive post-hoc fusion or freeze adaptation subspaces, which limit...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22743v1
Proxy-Based Approximation of Shapley and Banzhaf Interactions
Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sam...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22738v1
Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation
Large language model post-training methods such as supervised fine-tuning (SFT), reinforcement learning (RL), and distillation are often analyzed through their loss functions: maximum likelihood, policy gradients, forward KL, reverse KL, or related objective-level variants. We study a complementary ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22731v1
Reading Task Failure Off the Activations: A Sparse-Feature Audit of GPT-2 Small on Indirect Object Identification
We report a small, reproducible audit of which sparse-autoencoder (SAE) features of GPT-2 small fire differently on failed versus successful trials of the Indirect Object Identification (IOI) task. On 300 prompts, GPT-2 small reaches 79.7% accuracy; 146 of the 24,576 features in the layer-8 residual...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22719v1
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we inv...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22717v1
Abstraction for Offline Goal-Conditioned Reinforcement Learning
Markov Decision Processes (MDPs) often exhibit significant redundancy due to symmetries and shared structure across state-goal pairs in real-world Goal-Conditioned Reinforcement Learning (GCRL). While hierarchical policies have been motivated for horizon reduction via temporal abstraction in offline...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22711v1