AI News Archive: May 14, 2026 — Part 19

Sourced from 500+ daily AI sources, scored by relevance.

Do Coding Agents Understand Least-Privilege Authorization?
As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive surfaces.To study whether current models...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14859v1
A Deterministic Agentic Workflow for HS Tariff Classification: Multi-Dimensional Rule Reasoning with Interpretable Decisions
Harmonized System (HS) tariff classification is a high-stakes, expert-level task in which a free-form product description must be mapped to a specific six- or eight-digit code under the General Interpretive Rules (GIR), section notes, chapter notes, and Explanatory Notes. The difficulty lies not in ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14857v1
FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery
Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such a...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14854v1
IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification
Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limita...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14851v1
Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training
Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throu...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14773v1
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability
Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14738v1
SceneFunRI: Reasoning the Invisible for Task-Driven Functional Object Localization
In real-world scenes, target objects may reside in regions that are not visible. While humans can often infer the locations of occluded objects from context and commonsense knowledge, this capability remains a major challenge for vision-language models (VLMs). To address this gap, we introduce Scene...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14704v1
NeuroAtlas: Benchmarking Foundation Models for Clinical EEG and Brain-Computer Interfaces
Foundation models (FMs) promise to extract unified representations that generalize across downstream tasks. They have emerged across fields, including electroencephalography (EEG), but it is less clear how effective they are in this particular field. Published evaluations differ in datasets, in the ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14698v1
Open Computer Use
Open-source Computer Use MCP for AI agents
🧰 ToolsMay 14, 2026https://www.producthunt.com/products/open-computer-use?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs,...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14678v1
Agentic Design of Compositional Descriptors via Autoresearch for Materials Science Applications
Autoresearch offers a flexible paradigm for automating scientific tasks, in which an AI agent proposes, implements, evaluates, and refines candidate solutions against a quantitative objective. Here, we use composition-based materials-property prediction to test whether such agents can perform a task...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14671v1
How Sensitive Are Radiomic AI Models to Acquisition Parameters?
A main barrier for the deployment of AI radiomic systems in clinical routine is their drop in performance under heterogeneous multicentre acquisition protocols. This work presents a performance-oriented framework for quantifying scan parameter sensitivity of radiomic AI models, while identifying cli...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14667v1
MindGap: A Conversational AI Framework for Upstream Neuroplastic Intervention in Post-Traumatic Stress Disorder
Post-Traumatic Stress Disorder (PTSD) is fundamentally a neuroplastic problem traumatic contact events encode over-reactive neural pathways through Hebbian long-term potentiation, producing hair-triggered amygdala-HPA stress cascades that fire before conscious awareness can intercept them. Existing ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14660v1
Action-Inspired Generative Models
We introduce Action-Inspired Generative Models (AGMs), a dual-network generative framework motivated by the observation that existing bridge-matching methods assign uniform regression weight to every stochastic transition in the transport landscape, regardless of whether a given bridge sample lies a...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14631v1
An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization
A common critique of neural combinatorial-optimization solvers is that they are less energy-efficient than CPU metaheuristics, given the operational energy cost of training them on GPUs. This paper examines the inferential step from "training is expensive" to "neural solvers are net-inefficient", wh...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14624v1
Do We Really Need External Tools to Mitigate Hallucinations? SIRA: Shared-Prefix Internal Reconstruction of Attribution
Large vision-language models (LVLMs) often hallucinate when language priors dominate weak or ambiguous visual evidence. Existing contrastive decoding methods mitigate this problem by comparing predictions from the original image with those from externally perturbed visual inputs, but such references...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14621v1
Do Composed Image Retrieval Benchmarks Require Multimodal Composition?
Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., com...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14787v1
Persian MusicGen: A Large-Scale Dataset and Culturally-Aware Generative Model for Persian Music
Persian music, with its unique tonalities, modal systems (Dastgah), and rhythmic structures, presents significant challenges for music generation models trained primarily on Western music. We address this gap by curating the first large-scale dataset of Persian songs, comprising over 900 hours high-...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14765v1
Uncertainty Quantification for Large Language Diffusion Models
Large Language Diffusion Models (LLDMs) are emerging as an alternative to autoregressive models, offering faster inference through higher parallelism. Similar to autoregressive LLMs, they remain prone to hallucinations, making reliable uncertainty quantification (UQ) crucial for safe deployment. How...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14570v1
Mining Subscenario Refactoring Opportunities in Behaviour-Driven Software Test Suites: ML Classifiers and LLM-Judge Baselines
Context. Behaviour-Driven Development (BDD) software test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14568v1
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy
Agentic reinforcement learning trains large language models using multi-turn trajectories that interleave long reasoning traces with short environment-facing actions. Common policy-gradient methods, such as PPO and GRPO, treat each token in a trajectory equally, leading to uniform credit assignment....
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14558v1
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective paradigm for improving the reasoning capabilities of large language models. However, RLVR training is often hindered by sparse binary rewards and weak credit assignment, resulting in ambiguous optimization signals and ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14539v1
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built aroun...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14498v1
Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict
The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introd...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14473v1
Enjo Help Center
AI auto-builds your help centers that learn from your team
🧰 ToolsMay 14, 2026https://www.producthunt.com/products/enjo-help-center?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
When Answers Stray from Questions: Hallucination Detection via Question-Answer Orthogonal Decomposition
Hallucination detection in large language models (LLMs) requires balancing accu racy, efficiency, and robustness to distribution shift. Black-box consistency methods are effective but demand repeated inference; single-pass white-box probes are effi cient yet treat answer representations in isolation...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14449v1
Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture
Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ s...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14448v1
A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR
In hybrid automatic speech recognition (ASR) systems, the vocabulary size is unambiguous, typically determined by the number of phones, bi-phones, or tri-phones present in the language. In contrast, end-to-end ASR systems derive their vocabulary, often referred to as tokens from the text corpus used...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14427v1
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
While LLMs are increasingly used in commercial services, they pose privacy risks such as leakage of sensitive personally identifiable information (PII). For LLMs trained on multilingual corpora, Multilingual Machine Unlearning (MMU) aims to remove information across multiple languages. However, prio...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14404v1
Agentic Recommender System with Hierarchical Belief-State Memory
Memory-augmented LLM agents have advanced personalized recommendation, yet existing approaches universally adopt flat memory representations that conflate ephemeral signals with stable preferences, and none provides a complete lifecycle governing how memory should evolve. We propose MARS (Memory-Aug...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14401v1
Where Should Diffusion Enter a Language Model? Geometry-Guided Hidden-State Replacement
Continuous diffusion language models lag behind autoregressive transformers, partly because diffusion is applied in spaces poorly suited to language denoising and token recovery. We propose DiHAL, a geometry-guided diffusion-transformer hybrid that asks where diffusion should enter a pretrained tran...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14368v1
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax
Extending large language models (LLMs) to low-resource languages often incurs an "alignment tax": improvements in the target language come at the cost of catastrophic forgetting in general capabilities. We argue that this trade-off arises from the rigidity of supervised fine-tuning (SFT), which enfo...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14366v1
LLM-based Detection of Manipulative Political Narratives
We present a new computational framework for detecting and structuring manipulative political narratives. A task that became more important due to the shift of political discussions to social media. One of the primary challenges thereby is differentiating between manipulative political narratives an...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14354v1
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding
Discrete diffusion language models improve generation efficiency through parallel token prediction, but standard $X_0$ prediction methods introduce factorization errors by approximating the clean token posterior with independent token-wise distributions. This paper proposes Factorization-Error-Free ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14305v1
To See is Not to Learn: Protecting Multimodal Data from Unauthorized Fine-Tuning of Large Vision-Language Model
The rapid advancement of Large Vision-Language Models (LVLMs) is increasingly accompanied by unauthorized scraping and training on multimodal web data, posing severe copyright and privacy risks to data owners. Existing countermeasures, such as machine unlearning and watermarks, are inherent post-hoc...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14291v1
Web Agents Should Adopt the Plan-Then-Execute Paradigm
ReAct has become the default architecture across LLM agents, and many existing web agents follow this paradigm. We argue that it is the wrong default for web agents. Instead, web agents should default to plan-then-execute: commit to a task-specific program before observing runtime web content, then ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14290v1
Auditing Agent Harness Safety
LLM agents increasingly run inside execution harnesses that dispatch tools, allocate resources, and route messages between specialized components. However, a harness can return a correct, benign answer over a trajectory that accesses unauthorized resources or leaks context to the wrong agent. Output...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14271v1
Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems
Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HE...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14259v1
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
Extending the context window of large language models typically requires training on sequences at the target length, incurring quadratic memory and computational costs that make long-context adaptation expensive and difficult to reproduce. We propose EndPrompt, a method that achieves effective conte...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14589v1
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation
Automated code documentation is essential for modern software development, providing the contextual grounding that both human developers and coding agents rely on to navigate large codebases. Existing repository-level approaches process components independently, causing redundant retrieval and confl...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14563v1
DoDocs inc
AI OS automating accounting docs & reconciliation
🧰 ToolsMay 14, 2026https://www.producthunt.com/products/invoice-matchpoint-by-dodocs-ai?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space
This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14531v1
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
Holistic evaluation scores capture overall output quality but do not distinguish whether a model reproduced the structural form of a user's request from whether it preserved the user's specific intent. We propose a dimension-level intent fidelity evaluation framework, applied here through a structur...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14517v1
LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets,...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14454v1
SWE-Chain: Benchmarking Coding Agents on Chained Release-Level Package Upgrades
Coding agents powered by large language models are increasingly expected to perform realistic software maintenance tasks beyond isolated issue resolution. Existing benchmarks have shifted toward realistic software evolution, but they rarely capture continuous maintenance at the granularity of packag...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14415v1
Nexus : An Agentic Framework for Time Series Forecasting
Time series forecasting is not just numerical extrapolation, but often requires reasoning with unstructured contextual data such as news or events. While specialized Time Series Foundation Models (TSFMs) excel at forecasting based on numerical patterns, they remain unaware to real-world textual sign...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14389v1
NodeSynth: Socially Aligned Synthetic Data for AI Evaluation
Recent advancements in generative AI facilitate large-scale synthetic data generation for model evaluation. However, without targeted approaches, these datasets often lack the sociotechnical nuance required for sensitive domains. We introduce NodeSynth, an evidence-grounded methodology that generate...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14381v1
Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation
Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generativ...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14380v1
Herculean: An Agentic Benchmark for Financial Intelligence
As AI agents improve, the central question is no longer whether they can solve isolated well-defined financial tasks, but whether they can reliably carry out financial professional work. Existing financial benchmarks offer only a partial view of this ability, as they primarily evaluate static compet...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14355v1
Dynamic Latent Routing
We investigate the temporal concatenation of sub-policies in Markov Decision Processes (MDP) with time-varying reward functions. We introduce General Dijkstra Search (GDS), and prove that globally optimal goal-reaching policies can be recovered through temporal composition of intermediate optimal su...
📄 ResearchMay 14, 2026http://arxiv.org/abs/2605.14323v1