AI News Archive: May 21, 2026 — Part 18

Sourced from 500+ daily AI sources, scored by relevance.

Vector Policy Optimization: Training for Diversity Improves Test-Time Search
Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specifie...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22817v1
Reducing Political Manipulation with Consistency Training
Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22771v1
Understanding Data Temporality Impact on Large Language Models Pre-training
Large language models (LLMs) are typically trained on shuffled corpora, yielding models whose knowledge is frozen at train time and whose temporal grounding remains poorly understood. In this work, we study the impact of pre-training dynamics on the acquisition of time-sensitive factual knowledge, f...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22769v1
Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case st...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22732v1
AMEL: Accumulated Message Effects on LLM Judgments
Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated messa...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22714v1
Self-Policy Distillation via Capability-Selective Subspace Projection
Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the be...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22675v1
Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performan...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22654v1
Multi-Stage Training for Abusive Comment Detection in Indic Languages
In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can b...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22380v1
Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government
As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and economic realities. In this paper, we investigate public perceptions...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22650v1
Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsuper...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22620v1
Chinese sensorimotor and embodiment norms for 3,000 lexicalized concepts
Understanding how conceptual knowledge is grounded in bodily experience, and to what extent machine systems can acquire such knowledge without direct sensorimotor experience, are central questions in both cognitive science and embodied artificial intelligence research. Large-scale normative resource...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22616v1
Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents
Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilitie...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22608v1
Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion
Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22579v1
LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance
Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reason...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22567v1
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22536v1
Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning
Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stron...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22511v1
BeLink: Biomedical Entity Linking Meets Generative Re-Ranking
Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when a...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22501v1
In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks
The fundamental challenge of listening in multi-talker environments is a cognitive bottleneck, defined by the Ease of Language Understanding (ELU) model as a failure within the RAMPHO episodic buffer. Current deep neural networks for speech enhancement optimize purely for physical acoustics, failing...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22465v1
From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models
We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22462v1
Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation
Hate speech and misinformation frequently co-occur online, amplifying prejudice and polarization. Given their scale, using Large Language Models (LLMs) to assist expert counterspeech (CS) writing has gained interest, yet prior work has addressed these phenomena separately. We bridge this gap by stud...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22435v1
DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22411v1
Unified Data Selection for LLM Reasoning
Effectively training Large Language Models (LLMs) for complex, long-CoT reasoning is often bottlenecked by the need for massive high-quality reasoning data. Existing methods are either computationally expensive or fail to reliably distinguish high- from low-quality reasoning samples. To address this...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22389v1
Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning
Large language models are increasingly used as computational tools for modeling human-like behavior. We introduce a behavioral induction framework that modifies model policies through fine-tuning on structured decision-making tasks: using synthetic datasets inspired by maladaptive behavioral pattern...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22356v1
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation an...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22258v1
Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL
Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a r...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22217v1
Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents
In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding m...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22203v1
Evaluating Commercial AI Chatbots as News Intermediaries
AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22785v1
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
As wearable and mobile devices become increasingly embedded in daily life, they offer a practical way to continuously sense human motion in the wild. But inertial signals are highly dependent on the sensing setup, including body location, mounting position, sensor orientation, device hardware, and s...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22715v1
Tokenization with Split Trees
We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken into a full binary tree using precomputed byte n-gram counts, independent of any vocabulary. Given a vo...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22705v1
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an e...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22643v1
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discrimi...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22635v1
A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models
This tutorial develops diffusion models from the viewpoint of differential equations. We begin with the conditional Gaussian forward process and show that this path admits both an ordinary differential equation (ODE) representation and a stochastic differential equation (SDE) representation. Averagi...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22586v1
SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations
Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or p...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22564v1
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation
Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical st...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22544v1
Reflecti-Mate: A Conversational Agent for Adaptive Decision-Making Support Through System 1 and System 2 Thinking
Making high-stakes personal decisions involves cognitive, emotional, and intuitive processes, and individuals differ in how they allocate attention across these modes. Integration of these processes has shown to benefit decision making. Yet, most current decision-support systems focus primarily on s...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22509v1
Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation
Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross-lingual conversational capabilities, yet modeling culturally nuanced and context-dependent communication remains a critical bottleneck. Specifically, existing state-of-the-art models exhibit a severe prag...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22487v1
Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22476v1
Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings
We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22391v1
Slideshot
Product demo videos, recorded by your AI agent
🧰 ToolsMay 21, 2026https://www.producthunt.com/products/slideshot?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Boundary-targeted Membership Inference Attacks on Safety Classifiers
Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22373v1
TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation
Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chines...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22355v1
GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis
Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22228v1
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
Video Large Language Models (Video-LLMs) have made rapid progress on temporal video understanding, yet many fail at a basic perceptual primitive: signed image-plane motion direction. On simple videos of a single object moving left, right, up, or down, most Video-LLMs perform near chance, with above-...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22823v1
Cambrian-P: Pose-Grounded Video Understanding
Camera pose matters. The position and orientation of each viewpoint define a shared spatial coordinate frame that relates observations across video frames. Yet this signal is largely absent from multimodal LLMs (MLLMs) for video understanding, which process frames as isolated 2D snapshots, instead o...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22819v1
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
Robust training and validation of Autonomous Driving Systems (ADS) require massive, diverse datasets. Proprietary data collected by Autonomous Vehicle (AV) fleets, while high-fidelity, are limited in scale, diversity of sensor configurations, as well as geographic and long-tail-behavioral coverage. ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22809v1
Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition
Children with rare genetic diseases often exhibit distinctive facial phenotypes, yet developing computer vision systems for early diagnosis remains challenging due to extreme data scarcity, privacy constraints, and limited data sharing in pediatric settings. These challenges not only hinder automate...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22767v1
Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models
Vision-language models learn powerful multimodal embeddings, yet their internal semantics remain opaque. While sparse autoencoders (SAEs) can extract interpretable features, they rely on expanding the representation dimension, which compromises the original geometry and introduces redundancy. We int...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22679v1
What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining
CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision ...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22651v1
H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning
Parametric human models capture global pose but cannot represent the non-rigid surface dynamics of clothing and soft tissue. Generic scene flow estimates dense motion but breaks down on articulated bodies, where pixel-level supervision is also intractable to acquire. We introduce H-Flow, a dense hum...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22629v1
Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection
Object detection from Unmanned Aerial Vehicles (UAVs) is challenged by severe ego-motion, camera jitter, and large scale variations. While modern detectors perform well on static images, their direct application to UAV video often fails, particularly for small objects in dynamic scenes. Existing mot...
📄 ResearchMay 21, 2026http://arxiv.org/abs/2605.22605v1