AI News Archive: May 25, 2026 — Part 13

Sourced from 500+ daily AI sources, scored by relevance.

Efficient Benchmarking Is Just Feature Selection and Multiple Regression
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchma...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25773v1
StreamProfileBench: A Benchmark for Fine-Grained User Profile Inference in Real-World Streaming Scenarios
Large Language Models (LLMs) have reshaped user profiling, yet current evaluations mainly focus on static data snapshots. This paradigm overlooks the reality of personalized systems, where User-Generated Content (UGC) arrives continuously and fine-grained profile evolve rapidly. To bridge this gap, ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25758v1
Selective Latent Thinking: Adaptive Compression of LLM Reasoning Chains
Explicit chain-of-thought (CoT) reasoning substantially improves the reasoning ability of large language models (LLMs), but incurs high inference cost due to lengthy autoregressive traces. Existing latent reasoning methods offer a promising alternative, yet they often treat reasoning as uniformly co...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25745v1
PowLU: An Activation Function for Stable Pre-Training of LLMs
In contemporary large language models (LLMs), the swish-gated linear unit (SwiGLU) activation function is widely adopted to regulate the information flow and introduce non-linearity. For large positive inputs, SwiGLU approximates the quadratic function $x^2$, providing strong nonlinearity and expres...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25704v1
Neural Router: Semantic Content Matching for Agentic AI
Large language models (LLMs) can serve as the semantic-matching engine of a content-based publish/subscribe broker for agentic AI across the edge-cloud computing continuum, bridging the vocabulary and modality gaps that defeat keyword and embedding filters. Framed as offline multi-label retrieval ov...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25701v1
Testing the Deliteralization Hypothesis in Human and Machine Translation
The recent shift from dedicated NMT systems to general-purpose LLMs has reshaped machine translation, with LLMs reported to produce more fluent, less literal output than their predecessors. We test whether this shift extends to the deliteralization hypothesis, the long-standing claim from translatio...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25686v1
Orchestria
AI music engine with granular stem control
🧰 ToolsMay 25, 2026https://www.producthunt.com/products/orchestria?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Automated Benchmark Auditing for AI Agents and Large Language Models
Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchma...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26079v1
Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use
We test the standard RLVR tool-use recipe -- GRPO on Qwen2.5-7B-Instruct -- on a deliberately minimal knowledge-graph tool API: four Freebase navigation verbs over Complex WebQuestions. Under a self-verifiable retrieval reward, the policy's tool-grounded answer rate climbs from $3.8\%$ to $9.6\%$ ov...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26037v1
STORM: Internalized Modeling for Spatial-Temporal Reasoning in Video-Language Models
Many video reasoning tasks require tracking motion, temporal order, and evolving visual states across frames. Existing methods built on large vision-language models (LVLMs) often address this challenge by externalizing reasoning through textual chain-of-thought (CoT), keyframe selection, repeated fr...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26014v1
What Makes a Medical Checker Trainable? Diagnosing Signal Collapse and Reward Hacking in Checker-Guided RAG for Biomedical QA
Medical RAG needs evidence-grounded claims, so plugging a claim-level NLI checker into retrieval-augmented RL is intuitive. \textbf{We find that the checker's \emph{output distribution} during training, not its held-out accuracy, decides whether it provides trainable gradient.} We compare four NLI c...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25988v1
Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents
While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to pre...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25971v1
Triplet-Block Diffusion RWKV
Causal Transformer language models suffer from strictly sequential decoding and a quadratic per-step attention cost. While linear-time causal models and discrete diffusion models each address these weaknesses, their integration remains inherently inconsistent: diffusion requires bidirectional attent...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25969v1
Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training
We test whether the optimal learning-rate schedule depends on bit-width during from-initialisation quantisation-aware training (QAT) for sub-100M decoder language models. A 720-run factorial grid (Phase 2) over bit-width x warmdown fraction x LR magnitude x model size x seed (FP16/INT8/INT6, 15M-100...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25966v1
Universal Activation Verbalizer: A Unified Framework for Cross-Model Activation Explanation
Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activat...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25903v1
SAMark: A Self-Anchored Text Watermarking with Paragraph-Level Paraphrase Robustness
Semantic-level watermarking (SWM) improves robustness against text modifications by treating sentences as the basic unit. However, robustness to paragraph-level paraphrasing remains difficult because such attacks globally disrupt watermark signals by changing sentence order. In this work, we propose...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25796v1
Trait-Aware Policy Optimization for Autoregressive Multi-Trait Essay Scoring
Multi-trait essay scoring aims to provide fine-grained evaluation of writing quality across multiple dimensions. However, how to effectively post-train autoregressive scoring models remains underexplored. In this paper, we propose Trait-Aware Policy Optimization (TAPO), a post-training framework tai...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25731v1
CMAP: Cross-Modal Adaptive Prompting for Multi-Domain Task-Incremental Learning
Multi-domain task-incremental learning requires a model to sequentially acquire knowledge across visually diverse domains without forgetting prior tasks, and without access to task identity at inference. Parameter-efficient methods built on frozen vision-language models have made strong progress, ye...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25708v1
From Facts to Insights: A Persona-Driven Dual Memory Framework and Dataset for Role-Playing Agents
While role-playing agents excel in short-term interactions, long-term conversations overwhelm context windows, motivating external memory frameworks. Current systems typically rely on persona-agnostic summarization, which records facts without persona-specific interpretation, yielding generic respon...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25693v1
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose ph...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26087v1
Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning
Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26078v1
Paris 2.0: A Decentralized Diffusion Model for Video Generation
We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic G...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26064v1
Length Generalization with Log-Depth Recurrent Units
Length generalization remains a persistent challenge for neural networks: recurrent models tend to suffer from positional biases, while transformers are constrained by fixed computational depth. Regular languages provide a frequently used testbed for evaluating length generalization, as label predic...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26035v1
Causal methods for LLM development and evaluation
Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of add...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25998v1
Fuzzy PyTorch: Rapid Numerical Variability Evaluation for Deep Learning Models
We introduce Fuzzy PyTorch, a framework for rapid evaluation of numerical variability in deep learning (DL) models. As DL is increasingly applied to diverse tasks, understanding variability from floating-point arithmetic is essential to ensure robust and reliable performance. Tools assessing such va...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25991v1
STaT: Resolving Shape Distortion in Non-Stationary Time Series via Tri-Modal Synergy
Recent research in time series forecasting frequently investigates the integration of textual and visual modalities with numerical models to better navigate non-stationary environments. Despite delivering solid numerical results, existing multi-modal approaches usually encounter a dilemma: prioritiz...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25943v1
Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?
The effectiveness of multi-agent LLM deliberation depends not only on the agents' individual predictions, but also on how they communicate and collaborate. We study this mechanism through the lens of Friedkin-Johnsen (FJ) opinion dynamics, a tractable model for analyzing stubbornness, influence, and...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25929v1
Joint Optimization of Training and Inference in Federated Edge Learning via Constrained Multi-Objective Deep Reinforcement Learning
Federated edge learning (FEEL) has recently emerged as a promising paradigm for achieving edge intelligence (EI) via enabling collaborative model training across edge devices while protecting data privacy. In this paper, we put forth an online optimization framework that jointly manages federated tr...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25916v1
Predicting Stock Price Direction on Earnings Announcement Days using Multi-modal Deep Learning
Predicting stock price movements during Earnings Announcements (EAs) is a significant challenge due to market noise and high-impact price discontinuities. In this study, we evaluate whether pre-announcement news sentiment, firm fundamentals, and recent market dynamics jointly predict the directional...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25894v1
Merge-Bench: Resolve Merge Conflicts with Large Language Models
This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the merge resolution that developers committed to the repository. ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25890v1
Conformalised imprecise inference for robust extrapolation under limited data
Recent advances in uncertainty quantification increasingly emphasise the distinction between aleatory and epistemic uncertainty in machine learning, motivating the need for more unified frameworks. However, despite much progress in producing reliable predictions, existing methods often lack rigorous...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25882v1
Looped Diffusion Language Models
Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models for language modeling, yet the effective design of transformer architectures for MDMs remains underexplored. In this paper, we show that selectively looping the early-middle transformer layers significant...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26106v1
Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay
Models trained on a new task typically degrade on prior tasks, a phenomenon known as forgetting. Traditionally, mitigating forgetting has required replaying stored exemplars from prior tasks, which is often impractical. By contrast, language models can sample from their own training distribution, an...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26097v1
Active Query Synthesis for Preference Learning
Efficient learning of user preferences is crucial for many modern decision making systems but typically requires costly labeled data. Active learning reduces this cost, yet standard methods are computationally expensive due to pool-based evaluation. Further, most methods assume all query feedback is...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26072v1
Accelerating Bayesian inverse design in computational fluid dynamics using neural operators
Bayesian inverse design provides a principled framework for inferring aerodynamic geometries from sparse flow observations while quantifying uncertainty. However, its practical use in computational fluid dynamics (CFD) is severely limited by the cost of repeated high-fidelity simulations required fo...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26059v1
Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio
As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25967v1
Building an Adversarial Malware Dataset by Family and Type: Generation, Evasion, and Poisoning Evaluation
We present a dataset of adversarial malware samples derived from the public RawMal-TF collection of real-world malware binaries. Using a suite of adversarial malware generators, we construct two sets of adversarial PE files: 44,347 family-labelled samples and 33,596 type-labelled samples, achieving ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25937v1
Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing
Narrowly finetuned language models memorize implanted content verbatim, but auditing what a deployed model has been taught, without access to its weights or training data, remains an open challenge. Recent work shows that activation differences between base and finetuned models carry readable traces...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25902v1
LLMTest
Use the right LLMs in your apps. Setup fallbacks. Be happy.
🧰 ToolsMay 25, 2026https://www.producthunt.com/products/llmtest-2?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models
Vision-Language-Action (VLA) models are increasingly deployed on real robots, where each predicted action is executed and each failure carries a safety cost. They reach high success rates on clean inputs but collapse under small adversarial perturbations. A $16/255$ PGD attack on OpenVLA-7B drops LI...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25889v1
The Quantization Benefits of Residual-Free Transformers
Large-scale transformer training and deployment are increasingly constrained by the transfer of activations, gradients, and optimizer states across accelerators. Low-bit quantization offers a natural remedy, but transformer activations are often heavy-tailed and outlier-dominated, making simple quan...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25880v1
The Timing Dependencies of Trust: Speed, Accuracy, and cBCI Neuro-Decoupling in Human-AI Teams
The speed and accuracy of an artificial teammate fundamentally alter the failure states of Human-AI integration. While high-speed AI interventions risk inducing reflexive blind compliance, delayed interventions can induce ambiguous cognitive conflict. This study investigates how the fundamental char...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25868v1
UNATE: UNsupervised ATomic Embedding for crystal structures property prediction
Accurately predicting crystal properties is critical for accelerating materials discovery, but it is often limited by scarce labeled data and costly theoretical calculations. To alleviate this, we propose UNATE (Unsupervised Atomic Embedding), a framework that leverages structural information extrac...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25866v1
The Behavioral Credibility Trilemma: When Calibrated Autonomy Becomes Impossible
We prove that no reinforcement learning policy with confidence-gated autonomy can simultaneously achieve maximum helpfulness, optimal calibration, and full autonomy under rational oversight, whenever some tasks exceed the agent's reliable competence: the Behavioral Credibility Trilemma. The impossib...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25739v1
Stein-Encoder: A White-Box Supervised Encoder via Stein Identities in Multi-Modal Studies
In multi-modal biomedical research, integrating high-dimensional genomic data with clinical baselines is essential for precision medicine. However, standard deep neural network approaches often entangle these modalities, obscuring the specific predictive impact of genetic features and leading to pos...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25734v1
Learning Sparse Compositional Functions with Norm-Constrained Neural Networks
The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates based on parameter counts and sample complexity guarantees ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25608v1
Different Statistical Perspectives for Understanding Generalisation in Graph Neural Networks
Graph Neural Networks (GNN) are currently the most popular approach for learning and prediction on graph-structured data and are deployed in various fields, from social network analysis to drug discovery. However, there is limited mathematical understanding of the performance of GNNs. We discuss the...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25452v1
Goal-driven Bayesian Optimal Experimental Design for Robust Decision-Making Under Model Uncertainty
Bayesian optimal experimental design (BOED) selects experiments to maximize information gain about model parameters. However, in decision-critical settings, reducing parameter uncertainty does not necessarily improve downstream decisions, as only specific parameter directions relevant to the objecti...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26093v1
Geometry Adaptive Counterfactual Distribution Learning with Diffusion-Guided Smoothing
We study counterfactual distribution learning for high-dimensional outcomes whose counterfactual law may concentrate near lower-dimensional structure. Standard isotropic smoothing treats all ambient directions equally, leading to unfavorable scaling and unstable local inference. We propose two diffu...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25811v1
StrTransformer: Source-Wise Structured Transformers for Unsupervised Blind Source Recovery
This paper proposes StrTransformer, a source-wise structured Transformer framework for blind source recovery and branch-wise latent modeling. Instead of using an encoder to infer latent variables, StrTransformer directly optimizes the latent source matrix together with an observation-space mixer and...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25648v1