AI News Archive: May 21, 2026 — Part 19
Sourced from 500+ daily AI sources, scored by relevance.
- Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation...
- Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, ...
- Slideshot
Product demo videos, recorded by your AI agent
- Boundary-targeted Membership Inference Attacks on Safety Classifiers
Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, ...
- TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chines...
- GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural...
- Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-...
- Cambrian-P: Pose-Grounded Video Understanding
Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead o...
- Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. ...
- Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition
Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automate...
- Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models
Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We int...
- What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining
CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision ...
- H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning
Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense hum...
- Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection
Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing mot...
- Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping
Online map estimation is a crucial component of autonomous driving systems that reduces the reliance on costly high-definition maps. State-of-the-art (SOTA) methods commonly predict map elements as ordered sequences of points that form polylines and polygons. The evaluation of these methods relies p...
- VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
Spatio-temporal reasoning is a core capability for Multimodal Large Language Models (MLLMs) operating in the real world. As such, evaluating it precisely has become an essential challenge. However, existing spatio-temporal reasoning benchmark datasets primarily rely on static image sets or passively...
- Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain
Training Deep Neural Networks for tracking individual cells in biomedical videos requires a large amount of annotated data. The annotation of videos for cell tracking is very time consuming and often requires domain expertise; this explains the limited availability of public annotated data to addres...
- GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-ho...
- Visual Usability Checker
Validate your design decisions instantly with AI insights
- MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding
Two-wheelers account for a disproportionately high share of road fatalities in the Global South. Research on two-wheeler rider behavior, however, lags far behind four-wheelers, where multimodal datasets have driven major advances in Advanced Driver Assistance Systems (ADAS). To address this gap, we ...
- Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement
Deep learning has brought significant progress to medical image classification, yet most existing methods still rely on isolated visual evidence and cannot effectively leverage similar cases or external knowledge. In clinical practice, diagnosis is typically supported by historical similar cases and...
- Predicting 30-Day Heart Failure Readmissions Using Machine Learning: Insights From the Kansas Health Information Network (KHIN)
Background: Heart failure (HF) is a major contributor to inpatient hospital utilization, with persistently high 30-day readmission rates. Existing prediction tools are frequently restricted to primary-diagnosis HF admissions, potentially excluding clinically relevant HF-related hospitalizations. Objectives: To develop and validate risk prediction models using machine learning (ML)-based risk prediction models to predict 30-day readmissions among patients with HF using the Kansas Health Information Network, a statewide health information exchange. Methods: This retrospective cohort study analyzed HF hospitalizations using predictors including demographics, comorbidities, laboratory results, medications, clinical quality metrics for diabetes and kidney disease management, and prior healthcare utilization. Five ML models, including regularized logistic regression, random forest, extreme gradient boosting, categorical boosting, and deep neural network, were trained using stratified 5-fold
- Interpretable Symptom-Based Machine Learning for Parkinson's Disease Prediction: A Feasibility Study
Background: Parkinson's disease (PD) has a prolonged prodromal phase during which non-motor symptoms (NMS) may emerge years before the appearance of classical motor signs. This makes NMS a promising and clinically accessible source of information for early risk stratification. Objective: In this study, we investigated whether NMS alone can serve as reliable predictors of PD risk using clinical data from the Parkinson's Progression Markers Initiative (PPMI) cohort. Methods: We developed a stacked ensemble machine learning framework that integrates feature-level modelling, a global multivariate model, and a patient-similarity component to capture complementary patterns within NMS profiles. The model was trained using leakage-controlled patient-level validation and evaluated on an independent held-out test set. Results: The final ensemble achieved strong predictive performance, with an area under the ROC curve of 0.955, sensitivity of 0.929, and specificity of 0.900. Explainability analys
- Automated Macrolinguistic Discourse Analysis for Transdiagnostic Detection of Language Impairments
Macrolinguistic discourse analysis offers valuable insight into how patients with neurogenic communication disorders organize and produce informative speech, yet it remains a largely manual and labor-intensive process. We report an automated pipeline for macrolinguistic discourse analysis for individuals with aphasia and dementia that integrates automatic speech recognition (ASR), utterance segmentation, sentence-level embeddings, centroid-based main-concept matching, and rule-based coherence error classification. These algorithms were applied to Cinderella story retellings from 309 participants (113 controls, 102 post-stroke aphasia (PWA), and 94 dementia). The algorithm reliably identified main concepts (83% accuracy against human labels) and derived interpretable features such as semantic distance to a main concept centroid, main concept coverage, and coherence error rates. Crucially, diagnostic classification results showed that logistic-regression classifiers trained on 10 macroli
- Economic costing of evaluating, deploying and monitoring an artificial intelligence-based reconstruction for acceleration of rectal MRI examinations
Objectives: AI-based reconstructions can reduce MRI acquisition times and/or improve image quality. Guidelines recommend clinical evaluations and post-deployment monitoring of these novel methods, however, there has been little investigation of the clinical resources required for such assessments. The aim of this study was to evaluate the healthcare resource utilisation and potential savings associated with AI-based reconstructions in rectal MRI. Methods: A retrospective economic costing analysis was conducted from the NHS healthcare perspective. Resource utilisation data were extracted from the Electronic Patient Records for 9 healthy volunteer scans and 104 rectal MRI examinations evaluating an AI-based reconstruction. The resource profile included the MRI scan and the staff time required for data acquisition and analysis. Results: The clinical evaluation of the AI-based reconstruction cost {pound}15,023. Deployment of the AI-based reconstruction reduced the length of an MRI rectum s
- Developing Provider-Co-Created Prototypes Addressing Equity-Related Barriers in Liver Transplantation for Hepatocellular Carcinoma"
Background: Black patients and individuals with low socioeconomic status (SES) face significant disparities in accessing curative therapies for hepatocellular carcinoma (HCC), including liver transplantation. This study aimed to develop provider-co-created intervention prototypes in response to patient-identified barriers and recommendations. Methods: A human-centered design session with hepatology and transplant providers at a large academic medical center was conducted. Prior to the session, participants were presented with barriers and preliminary solutions identified through an earlier human-centered design session with Black and low-SES patients. Using structured ideation methods, including brainwriting, challenge mapping, and concept voting, providers co-created intervention prototypes. Final concepts were synthesized from patient insights, provider input, and design methods using affinity diagramming and concept modeling. Results: Nine providers participated in the session. They
- AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
Vision-and-Language Navigation (VLN) requires an agent to ground language instructions to its own movement within a visual environment. While state-of-the-art methods leverage the reasoning capabilities of Vision-Language Models (VLMs) for end-to-end action prediction, they often lack an explicit an...
- GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
Vision-Language-Action (VLA) models have shown strong potential for general-purpose robot manipulation by unifying perception and action. However, existing VLA systems primarily rely on textual instructions and struggle to resolve spatial ambiguity in complex scenes with multiple similar objects. To...
- DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
Representation Autoencoders (RAEs) leverage frozen vision foundation models (VFMs) as tokenizer encoders, providing robust high-level representations that facilitate fast convergence and high-quality generation in latent diffusion models. However, freezing the VFM inherently constrains its spatial r...
- Spectral Tail Auxiliary Learning for AI-Generated Image Detection
As generative image models evolve rapidly, the perceptual gap between generated and real images continues to narrow, making AI-generated image detection increasingly challenging. Many existing methods exploit frequency-domain cues for detection, typically described as frequency-domain artifacts or h...
- WorldKV: Efficient World Memory with World Retrieval and Compression
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks re...
- Swift Sampling: Selecting Temporal Surprises via Taylor Series
While most frames in long-form video are redundant, the critical information resides in temporal surprises: moments where the actual visual features deviate from their predicted evolution. Inspired by the human brain's predictive coding, we introduce Swift Sampling, an elegant, training-free frame s...
- Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment
Deploying vision models across devices with varying resource constraints, or even on a single device where available compute fluctuates due to battery state, thermal throttling, or latency deadlines, typically requires training and maintaining separate models. Width-adaptive inference addresses this...
- From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
Vision-Language-Action (VLA) models often suffer from performance degradation under distribution shifts, as they struggle to learn generalized behavior representations across varying environments. While existing approaches attempt to construct behavior representations through action-centric latent v...
- SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Ro...
- SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
While large language models provide strong compositional reasoning, existing reasoning segmentation pipelines fail to transparently connect this reasoning to visual perception. Current methods, such as latent query alignment, are end-to-end yet opaque "black boxes". Conversely, textual localization ...
- From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder
Dual-energy X-ray absorptiometry (DXA) is widely used for large-scale skeletal assessment, yet learning controllable and interpretable factor-specific anatomical variation remains challenging. We propose a metadata-conditioned causal hierarchical variational autoencoder (CHVAE) for causally consiste...
- GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT
Grounding radiology report descriptions to 3D CT volumes is essential for verifiable clinical interpretation, yet remains challenging due to the semantic-spatial gap between free-text narratives and volumetric anatomy. Existing report-assisted and vision-language grounding methods typically rely on ...
- Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following
Gaze following requires both scene understanding and gaze reasoning to localize the gaze target of an in-scene person. Recently, vision foundation models (VFMs) have demonstrated strong performance on this task, enabling simpler architectures while outperforming prior methods. However, we observe a ...
- Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure
Frozen Vision Foundation Models (VFMs) with lightweight classification heads are increasingly used in medical imaging because they offer efficient and reproducible deployment. Yet noisy-label learning methods for this frozen-feature regime remain poorly understood, and most existing methods still re...
- SceneAligner: 3D-Grounded Floorplan Localization in the Wild
Many public buildings provide floorplans with a "you are here" indicator to help visitors orient themselves. Floorplan localization seeks to computationally replicate this capability by determining where visual observations were captured within a floorplan. However, existing methods typically assume...
- AlliHat
Claude AI in your Safari sidebar
- SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation
Accurate segmentation of brain tumour sub-regions from multi-parametric MRI is critical for treatment planning yet remains challenging due to morphological variability, class imbalance, and overlapping appearances of tumour regions across imaging sequences. We propose SegGuidedNet, a three-dimension...
- FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
Fashion image retrieval is a cornerstone of modern e-commerce systems. A unified framework that supports diverse query formats and search intentions is highly desired in practice. However, existing approaches focus on narrow retrieval tasks and do not fully capture such diversity. Therefore, in this...
- Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
Traditional visual object tracking (VOT) methods typically rely on task-specific supervised training, limiting their generalization to unseen objects and challenging scenarios with distractors, occlusion, and nonlinear motion. Recent vision foundation models, exemplified by SAM 2, learn strong video...
- LACO: Adaptive Latent Communication for Collaborative Driving
Collaborative driving aims to improve safety and efficiency by enabling connected vehicles to coordinate under partial observability. Recent approaches have evolved from sharing visual features for perception to exchanging language-based reasoning through foundation models for behavioral coordinatio...
- Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline
Fine-grained semantic segmentation requires both precise localization and discrimination between visually similar classes. In FungiTastic, this problem is further complicated by a long-tailed distribution and strong variation in image acquisition conditions. We propose a training-free two-stage fram...
- Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling
Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting...
- Matching with Deliberation: Test-Time Evolutionary Hierarchical Multi-Agents for Zero-Shot Compositional Image Retrieval
Zero-Shot Compositional Image Retrieval (ZS-CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitutes the core challenge of the task. Existing methods often suffer from Perception M...
- MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation
Evaluating single-concept personalization in text-to-image diffusion requires measuring both concept preservation, which captures identity fidelity to a reference, and prompt following, which captures whether the generated scene matches the prompt. Existing metrics commonly compute these signals usi...