AI News Archive: May 25, 2026 — Part 12

Sourced from 500+ daily AI sources, scored by relevance.

Creative Quality Alignment: Expert Tacit Knowledge Transfer via Chain-of-Thought Fine-Tuning
This paper provides an empirical implementation of the creative quality metric proposed in Calibrated Surprise (Zou & Xu, 2026a). The question this paper addresses is: does this mathematical claim hold at the engineering level? To make the answer as general as possible, we deliberately choose the ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25977v1
QUIET: A Multi-Blank Cascaded Story Cloze Benchmark for LLM Creative Generation Capability
Large language models (LLMs) face a dual challenge in creative capability evaluation: existing benchmarks (e.g., Story Cloze Test, HellaSwag) measure models' discriminative ability over narrative continuation using multiple-choice recognition paradigms, rather than directly measuring creative genera...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25955v1
Step-TP: A Grounded, Step-Level Dataset with Chain-of-Thought Reasoning for LLM-Guided Tensor Program Optimization
Despite the strong reasoning capabilities of large language models (LLMs), optimizing the execution efficiency of tensor programs remains challenging due to the need for precise, composable transformation decisions. Recent LLM-guided approaches frame tensor program optimization as an iterative decis...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25954v1
VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding
Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse at...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25952v1
Quantitative Evaluation of the Severity of Posttraumatic Stress Disorder through Transfer Learning from Specific Phobia Data
Posttraumatic stress disorder (PTSD) is a prevalent and debilitating mental health condition with significant personal and societal impacts. Current clinical assessments of PTSD often rely on subjective evaluations, which can be time-consuming, costly, and prone to human bias. This study proposes a ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25933v1
Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning
While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25920v1
$D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden re...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25893v1
Causal Tongue-Tie: LLMs Can Encode Causal Direction, But Their Yes/No Outputs Fail to Express
We find a mismatch between what large language models encode about a causal question and what they answer. On anti-commonsense CLadder items, a fixed linear probe recovers the evidence-supported answer from the model's hidden state (accuracy approximately 0.97), while the spoken Yes/No reverts to th...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25891v1
Explaining Too Much? Understanding How Large Language Model Reasoning Traces Influence Performance and Metacognition
Large Language Model interfaces are increasingly verbose, exposing intermediate reasoning traces alongside final answers. Traces are framed as transparency mechanisms, yet it is unclear how people use them to solve problems. We report a preregistered between-subjects study (N = 559) in which partici...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25856v1
TIAR: Trajectory-Informed Advantage Reweighting for LLM Abstention Learning
This paper investigates large language model (LLM) abstention learning, specifically using ternary reward, which incentivize truthfulness in large language models. This paper extends that idea by moving from a ternary reward to a Trajectory-Informed advantage reweighting, dynamically re-weights the ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25850v1
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning
Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. W...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25842v1
TTPrint: Evidence-Grounded TTP Extraction via Diverge-then-Converge Verification
Extracting MITRE ATT&CK techniques from cyber threat intelligence (CTI) reports is an open-set, multi-label problem requiring both high recall (not missing techniques) and high precision (not hallucinating unsupported ones). Existing methods--rule-based, supervised, and LLM-based--struggle to achiev...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25836v1
MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
We present MobileGym, a browser-hosted, lightweight, fully controllable environment for everyday mobile use, targeting interaction fidelity without replicating proprietary backends. It enables two capabilities previously out of reach for everyday apps: verifiable outcome signals through deterministi...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26114v1
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
This paper studies the next major bottleneck in agentic AI as system scaling, not only model scaling: the design of auditable, persistent, modular, and verifiable architectures around foundation models. We refer to this shift as scaling the harness: treating the structured execution layer around a f...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26112v1
Beyond Summaries: Structure-Aware Labeling of Code Changes with Large Language Models
Code review is a critical practice in software engineering, yet the growing scale and frequency of code patches in modern projects, together with the widespread adoption of AI code assistants, make manual review increasingly challenging. Identifying the types of changes within a patch, such as renam...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26100v1
Language Models Need Sleep
Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26099v1
Channel-wise Vector Quantization
We present Channel-wise Vector Quantization (CVQ), a novel image tokenization paradigm that replaces patch-wise tokens with channel-wise tokens. Unlike conventional vector quantization, which assigns a discrete token to each patch feature vector, CVQ quantizes each channel of the feature map. This f...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26089v1
Retrying vs Resampling in AI Control
AI coding scaffolds like Claude Code and Codex use \textit{retrying}: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26047v1
DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models
Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applicati...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26038v1
Everything at Every Scale: Scale-Invariant Diffusion with Continuous Super-Resolution
Creating images from noise is image generation; reconstructing fine details from coarse inputs is super-resolution. Despite their practical differences, both can be understood as reversing information loss across scales. We introduce $\textbf{SKILD}$, a $\textbf{S}$cale-invariant $\textbf{K}$-Space ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26032v1
Retrieval-Augmented Detection of Potentially Abusive Clauses in Chilean Terms of Service
Online Terms of Service often function as contracts of adhesion, creating asymmetries that may expose consumers to potentially abusive clauses. In Chile, assessing such clauses is legally challenging because some provisions clearly violate mandatory consumer law, whereas others depend on broader sta...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26019v1
AI-Assisted Systematization for Evaluating GenAI Systems
Evaluating generative AI (GenAI) systems is challenging because many targets of evaluation are broad, contested concepts, such as "reasoning," "fairness," or "creativity." When these concepts are left underspecified, it becomes unclear what should be measured or how evaluation results should be inte...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26001v1
Neural Scalable Symbolic Search Framework for Complex Logical Queries with Multiple Free Variables
Complex Query Answering (CQA) is a fundamental knowledge representation and reasoning task over incomplete knowledge graphs (KGs). Answering existential first-order queries with $k$ free variables (i.e., $\text{EFO}_k$ queries) is a crucial yet challenging problem, as it requires ranking answer tupl...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25985v1
LECTOR: Joint Optimization of Scientific Reasoning Graphs and Introduction Generation
AI Scientists have shown promising progress across multiple stages of the research pipeline, among which automatic scientific paper writing remains a formidable challenge. The Introduction writing is especially challenging, which demands not only linguistic fluency, but logical soundness and verifia...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25964v1
Continual Speaker Identity Unlearning with Minimal Interference
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25962v1
Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers
Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of parameters. We argue that scale is a poor substitute for architectural inductive bias in this domain: structured priors deliver outsized parameter efficiency, and the pa...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25949v1
EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory
Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insu...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25944v1
From Latent Space to Training Data: Explainable Specialization in Minimal MLPs
We here study whether training biases can make hidden neurons specialize in minimal one-hidden-layer MLPs, and whether such specialization improves prototype-based reconstruction of the training dataset from the learned weights. We consider Gaussianactivation MLPs of width equal to dataset size and ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25939v1
Explore Before You Solve: The Speed--Depth Trade-off in Epistemic Agents for ARC-AGI-3
We systematically investigate all 25 public ARC-AGI-3 games and find that every one is reachable through non-intelligent strategies: 10 in a single blind step, 5 after one probing action, 1 via repeated ACTION1 presses, 1 via diverse exploration, and 8 via single repeated actions with sufficient bud...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25931v1
MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images
3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25861v1
From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch
The expansion of data centers (DCs) drives a sustained increase in electricity demand and associated water withdrawals at generation sites. These withdrawals occur at generation sites and are virtually allocated to demand based on network power flows. Consequently, the actual water footprint of a sp...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25854v1
Geometric Evolution Maps: Extracting Stable Concept Probes from Transformer Residual Streams
Concept probes extracted from transformer residual streams are only as reliable as the layer from which they are extracted. The common practice of probing at a fixed late layer or at the peak of a separation score function ignores a fundamental structural feature: concept representations undergo sub...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25848v1
Context-Instrumental Data Distillation for Kubernetes Manifest Generation: Method and Experimental Evaluation
This paper examines the specialization of Small Language Models (SLMs) with up to 4 billion parameters for generating artifacts in domain-specific languages (DSL). Kubernetes manifests are chosen as the target domain. We propose the context-instrumental data distillation method: the source corpus is...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25835v1
Prism: A Plug-in Reproducible Infrastructure for Scalable Multimodal Continual Instruction Tuning
Multimodal Large Language Models (MLLMs) achieve versatility by reformulating diverse tasks into a unified instruction-following framework via instruction tuning. However, real-world deployment requires continuous adaptation to emerging tasks, motivating Multimodal Continual Instruction Tuning (MCIT...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26110v1
WhoSaidIt: Human-LLM Collaborative Annotation for Text-Based Multilingual Speaker-Attribute Classification
Annotating speaker attributes from text is inherently ambiguous, particularly in multilingual settings where demographic and social cues are implicit and culturally variable. We propose a human-large language model (LLM) collaborative re-annotation framework for stabilizing multilingual speaker-attr...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26070v1
Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech
Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino-English code-switching is pervasive and no prior work has addressed NLP-based dem...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26007v1
MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models
Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.26004v1
When Do LLM Agents Treat Surface Noise Differently from Semantic Noise? A 68-Cell Measurement Study with a Held-Out Trace-Level Validation
We document an empirical phenomenon in chain-of-thought and ReAct agents driven by ten large language models from seven architecture families: meaning-bearing perturbations (e.g., paraphrase, synonym) alter final answers more often than presentation perturbations (e.g., formatting, reordering) of co...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25981v1
PolyGnosis 2.0: Enhancing LLM Reasoning via Agentic Harness Engineering for Polymarket and OSINT Insight Extraction
This paper introduces PolyGnosis 2.0, a pioneering multi-agent architecture designed to extract predictive intelligence by synthesizing Polymarket anomaly signals with global Open Source Intelligence (OSINT) streams, specifically Global Database of Events, Language, and Tone (GDELT). We define and t...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25958v1
Thaka at KSAA-2026 Task 2: Regularized Fine-Tuning for Arabic Speech Diacritization
We describe the winning system for Task 2 of the KSAA-2026 Shared Task on Arabic Speech Dictation with Automatic Diacritization. The task requires producing fully diacritized Arabic text from speech audio and undiacritized transcripts, with only 2,327 training samples available and no external data ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25928v1
Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT
Recent automated essay scoring (AES) studies increasingly use pretrained transformer models, but these models are usually pretrained on general-domain English and may under-represent second-language learner writing. This study investigates whether domain-adaptive continued pretraining (DAPT) on the ...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25924v1
Mitigating Provenance-Role Collapse in Long-Term Agents via Typed Memory Representation
Long-term memory is essential for persistent LLM agents, yet prevailing architectures store historical interactions as unstructured, flat text. This unconstrained storage induces provenance-role collapse, a critical failure mode where agents suffer from source-monitoring errors. To resolve this cogn...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25869v1
When Self-Belief Misleads: Active Label Acquisition for Reinforcement Learning with Verifiable Rewards
Large Language Models (LLMs) have achieved remarkable advancements in reasoning capabilities empowered by Reinforcement Learning with Verifiable Rewards (RLVR). Nonetheless, RLVR intrinsically relies on ground-truth labels for reward computation, the acquisition of which is often prohibitively expen...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25864v1
On the Limits of Model Merging for Multilinguality in Pre-Training
Endowing models with consistent multilingual performance can be achieved by mixing pre-training data, or post-training approaches such as language-specific model merging. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the ef...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25846v1
When Search Becomes Memory: Turning Robot Design Trials into Transferable Skills
Large language models (LLMs) are increasingly used as proposal generators for evolutionary robot design, yet most loops remain memoryless: simulator results shape the next population but are not preserved as reusable design knowledge. We present Auto-Robotist, a self-evolving LLM agent that distills...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25832v1
Clarify, Abstain or Answer? Strategising in Conversation with Belief-Augmented Generation
Large language models (LLMs) define a distribution over text, which can be viewed as a probabilistic representation of uncertainty: sampling K responses yields a belief state - responses a model deems plausible. Existing work exploits this representation for narrow tasks like either decoding or sele...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25831v1
Fine-Tuning Over Architectural Complexity: Broad-Coverage PII Detection on PIIBench with DeBERTa
Personally identifiable information (PII) detection systems are frequently trained within narrow source or domain boundaries, limiting coverage when deployed on heterogeneous text. We study model fine-tuning on a corrected multi-source PIIBench preparation spanning 82 retained entity types across te...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25816v1
Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution
Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled wo...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25814v1
Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation
Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotatio...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25781v1
Efficient Benchmarking Is Just Feature Selection and Multiple Regression
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression with feature selection, we find that existing efficient benchma...
📄 ResearchMay 25, 2026http://arxiv.org/abs/2605.25773v1