AI News Archive: May 5, 2026 — Part 26

Sourced from 500+ daily AI sources, scored by relevance.

The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models
Conceptual analysis -- proposing definitions and refining them through counterexamples -- is central to philosophical methodology. We study whether language models can perform this task through iterated analysis and repair chains: one model instance generates counterexamples to a proposed definition...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03936v1
Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems
Frontier AI systems perform best in settings with clear, stable, and verifiable objectives, such as code generation, mathematical reasoning, games, and unit-test-driven tasks. They remain less reliable in open-ended settings, including scientific assistance, long-horizon agents, high-stakes advice, ...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03900v1
Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework
Individuals frequently form deep attachments to physical objects (e.g., plush toys) that usually cannot sense or respond to their emotions. While AI companions offer responsiveness and personalization, they exist independently of these physical objects and lack an ongoing connection to them. To brid...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03882v1
DMGD: Train-Free Dataset Distillation with Semantic-Distribution Matching in Diffusion Models
Dataset distillation enables efficient training by distilling the information of large-scale datasets into significantly smaller synthetic datasets. Diffusion based paradigms have emerged in recent years, offering novel perspectives for dataset distillation. However, they typically necessitate addit...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03877v1
Quantifying the human visual exposome with vision language models
The visual environment is a fundamental yet unquantified determinant of mental health. While the concept of the environmental exposome is well established, current methods rely on coarse geospatial proxies or biased self reports, failing to capture the first person visual context of daily life. We a...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03863v1
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic data science (ADS) systems are rapidly improving their capability to autonomously analyze, fit, and interpret data, potentially moving towards a future where agents conduct the vast majority of data-science work. However, current ADS systems use statistical tools designed to be interpretable...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03808v1
What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity
To navigate partially observable visual environments, recent VLM agents increasingly internalize world modeling capabilities into their policies via explicit CoT reasoning, enabling them to mentally simulate futures before acting. However, relying solely on passive reasoning over visited states is i...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03782v1
Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks
While Large Vision-Language Models (LVLMs) offer powerful capabilities, they pose privacy risks by unintentionally memorizing sensitive personal information. Current unlearning benchmarks attempt to mitigate this using fictitious identities but overlook a critical stage 1 failure: models fail to eff...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03759v1
A Workflow-Oriented Framework for Asynchronous Human-AI Collaboration in Hybrid and Compute-Intensive HPC Environments
Human involvement is critical in training and deploying AI systems in high-stakes defence and security contexts. However, real-time interaction is impractical in HPC environments due to compute intensity and resource constraints. We present a workflow framework that enables asynchronous human-AI col...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03743v1
MEMTIER: Tiered Memory Architecture and Retrieval Bottleneck Analysis for Long-Running Autonomous AI Agents
Long-running autonomous AI agents suffer from a well-documented memory coherence problem: tool-execution success rates degrade 14 percentage points over 72-hour operation windows due to four compounding failure modes in existing flat-file memory systems. We present MEMTIER, a tripartite memory archi...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03675v1
FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers
Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encodin...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03669v1
Stage Light is Sequence$^2$: Multi-Light Control via Imitation Learning
Music-inspired Automatic Stage Lighting Control (ASLC) has gained increasing attention in recent years due to the substantial time and financial costs associated with hiring and training professional lighting engineers. However, existing methods suffer from several notable limitations: the low inter...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03660v1
AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics
Video generation models internalize physical realism as their prior. Anime deliberately violates physics: smears, impact frames, chibi shifts; and its thousands of coexisting artistic conventions yield no single "physics of anime" a model can absorb. Physics-biased models therefore flatten the artis...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03652v1
Self-Improvement for Fast, High-Quality Plan Generation
Generative models trained on synthetic plan data are a promising approach to generalized planning. Recent work has focused on finding any valid plan, rather than a high-quality solution. We address the challenge of producing high-quality plans, a computationally hard problem, in sub-exponential time...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03625v1
Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models
Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer block...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03609v1
Pickle
Personal memory layer that works across AI apps
🧰 ToolsMay 5, 2026https://www.producthunt.com/products/pickle-10?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments
Large Language Models (LLMs) are prone to factual hallucinations, risking their reliability in real-world applications. Existing hallucination detectors mainly extract micro-level intrinsic patterns for uncertainty quantification or elicit macro-level self-judgments through verbalized prompts. Howev...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03971v1
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes m...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03903v1
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
This preprint presents a systematic, research-oriented practicum that guides the reader through the entire modern NLP pipeline: from tokenisation and vectorisation to fine-tuning of large language models, retrieval-augmented generation, and reinforcement learning from human feedback. Twelve hands-on...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03799v1
TriBench-Ko: Evaluating LLM Risks in Judicial Workflows
Large language models (LLMs) are increasingly integrated into legal workflows. However, existing benchmarks primarily address proxy tasks, such as bar examination performance or classification, which fail to capture the performance and risks inherent in day-to-day judicial processes. To address this...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03792v1
Benchmarking Parameter-Efficient Fine-Tuning of Large Language Models for Low-Resource Tajik Text Generation with the Tajik Web Corpus
This paper is devoted to the adaptation of generative large language models for the Tajik language, a low-resource language with Cyrillic script. To overcome the shortage of digital text resources, the author created and publicly released the Tajik Web Corpus, the largest open-access corpus of Tajik...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03742v1
A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language
The performance of end-to-end automatic speech recognition (ASR) systems enables their increasing integration into numerous applications. While there are various benefits to such speech-to-text systems, the choice of hyperparameters and models plays a crucial role in their performance. Typically, th...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03696v1
A Paradigm for Interpreting Metrics and Identifying Critical Errors in Automatic Speech Recognition
The most commonly used metrics for evaluating automatic speech transcriptions, namely Word Error Rate (WER) and Character Error Rate (CER), have been heavily criticized for their poor correlation to human perception and their inability to take into account linguistic and semantic information. While ...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03671v1
Annotation Quality in Aspect-Based Sentiment Analysis: A Case Study Comparing Experts, Students, Crowdworkers, and Large Language Model
Aspect-Based Sentiment Analysis (ABSA) enables fine-grained opinion analysis by identifying sentiments toward specific aspects or targets within a text. While ABSA has been widely studied for English, research on other languages such as German remains limited, largely due to the lack of high-quality...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03624v1
BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA
This paper presents the joint participation of the BIT.UA and AAUBS groups in the ArchEHR-QA 2026 shared task, which focuses on clinical question answering and evidence grounding in a low-resource setting. Due to the absence of training data and the strict data privacy constraints inherent to the he...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03618v1
AfriVox-v2: A Domain-Verticalized Benchmark for In-the-Wild African Speech Recognition
Recent large language models (LLMs) show strong speech recognition and translation capabilities for high-resource languages. However, African languages remain dramatically underrepresented in benchmarks, limiting their practical use in low-resource settings. While early benchmarks tested African lan...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03590v1
SURE-RAG: Sufficiency and Uncertainty-Aware Evidence Verification for Selective Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) grounds answers in retrieved passages, but retrieval is not verification: a passage can be topical and still fail to justify the answer. We frame this gap as evidence sufficiency verification for selective RAG answering: given a question, a candidate answer, and ...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03534v1
Revisiting Graph-Tokenizing Large Language Models: A Systematic Evaluation of Graph Token Understanding
The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph tasks. As a widely recognized paradigm, Graph-Tokenizing LLMs (GTokenLLMs) compress complex graph data into graph tokens and treat them as prefix tokens for queryi...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03514v1
CuraView: A Multi-Agent Framework for Medical Hallucination Detection with GraphRAG-Enhanced Knowledge Verification
Discharge summaries require extracting critical information from lengthy electronic health records (EHRs), a process that is labor-intensive when performed manually. Large language models (LLMs) can improve generation efficiency; however, they are prone to producing faithfulness hallucinations, stat...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03476v1
Retrieving Floods without Floodlights: Topic Models as Binary Classifiers for Extreme Climate Events in German News
In studies of media coverage of extreme climate events, NLP methods have become indispensable for identifying relevant texts in large news databases. Still, enough annotated data to train accurate deep learning-based classifiers from scratch is often not available. Topic Models have the advantage of...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03450v1
Sentiment Analysis of Indonesian Spotify Reviews Using Machine Learning and BiLSTM
This paper benchmarks classical machine learning and deep learning approaches for three-class sentiment classification of Indonesian Spotify reviews. Using 100,000 scraped reviews and 70,155 cleaned samples, the study compares Support Vector Machine, Multinomial Naive Bayes, and Decision Tree models...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03443v1
Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis
Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- byp...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03441v1
A Comparison of Traditional Machine Learning Algorithms and LSTM-Based Deep Learning Models for Email Sentiment Analysis
The rapid growth of electronic communication has necessitated more robust systems for email classification and sentiment detection. This study presents a comparative performance analysis between traditional machine learning algorithms and deep learning architectures, specifically focusing on Support...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03440v1
Benchmarking Logistic Regression, SVM, Naive Bayes, and IndoBERT Fine-Tuning for Sentiment Analysis on Indonesian Product Reviews
The exponential growth of e-commerce platforms in Indonesia has generated a massive volume of user-generated product reviews. Analyzing the sentiment of these reviews is critical for measuring customer satisfaction and identifying product issues at scale. This paper benchmarks traditional Machine Le...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03439v1
From prompting to evidence-based translation: A RAG+prompt system for Japanese-Chinese translation and its pedagogical potential
Large language models perform well on high-resource pairs but are less reliable for Japanese-Chinese sentences containing noun-modifying clause constructions (NMCCs). This study evaluates a retrieval-augmented generation RAG+Prompt translation system that integrates linguistic analysis, embedding-ba...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03387v1
Two Calls, Two Moments, and the Vote-Accuracy Curve of Repeated LLM Inference
Repeated sampling is a standard way to spend test-time compute, but its benefit is controlled by the latent distribution of correctness across examples, not by one-call accuracy alone. We study the binary correctness layer of repeated LLM inference under conditional-i.i.d. calls. One labeled call id...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03379v1
RAG over Thinking Traces Can Improve Reasoning Tasks
Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the c...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03344v1
When to Think, When to Speak: Learning Disclosure Policies for LLM Reasoning
In single-stream autoregressive interfaces, the same tokens both update the model state and constitute an irreversible public commitment. This coupling creates a \emph{silence tax}: additional deliberation postpones the first \emph{task-relevant} content, while naive early streaming risks premature ...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03314v1
SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification
De-identification of clinical text remains essential for secondary use of electronic health records (EHRs), yet public benchmarks such as i2b2 2006/2014 are over a decade old and lack the semantic and demographic diversity of modern narratives. While Large Language Models (LLMs) achieve state-of-the...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03301v1
LLM-XTM: Enhancing Cross-Lingual Topic Models with Large Language Models
Cross-lingual topic modeling aims to discover shared semantic structures across languages, yet existing models depend on sparse bilingual resources and often yield incoherent or weakly aligned topics. Recent LLM-based refinements improve interpretability but are costly, document-level, and prone to ...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03299v1
Transformers with Selective Access to Early Representations
Several recent Transformer architectures expose later layers to representations computed in the earliest layers, motivated by the observation that low-level features can become harder to recover as the residual stream is repeatedly transformed through depth. The cheapest among these methods add stat...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03953v1
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers
Transformers are effective at inferring the latent task from context via two inference modes: recognizing a task seen during training, and adapting to a novel one. Recent interpretability studies have identified from middle-layer representations task-specific directions, or task vectors, that steer ...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03780v1
Rose-SQL: Role-State Evolution Guided Structured Reasoning for Multi-Turn Text-to-SQL
Recent advances in Large Reasoning Models (LRMs) trained with Long Chain-of-Thought have demonstrated remarkable capabilities in code generation and mathematical reasoning. However, their potential in multi-turn Text-to-SQL tasks remains largely underexplored. Existing approaches typically rely on u...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03720v1
Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies
Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largel...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03596v1
Detecting Stealth Sycophancy in Mental-Health Dialogue with Dynamic Emotional Signature Graphs
As conversational AI therapists are increasingly used in psychological support settings, reliable offline evaluation of therapeutic response quality remains an open problem. This paper studies multi-domain support-dialogue evaluation without relying on large language models as final judges. We use a...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03472v1
FINER-SQL: Boosting Small Language Models for Text-to-SQL
Large language models have driven major advances in Text-to-SQL generation. However, they suffer from high computational cost, long latency, and data privacy concerns, which make them impractical for many real-world applications. A natural alternative is to use small language models (SLMs), which en...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03465v1
Gecko
AI that runs equipment rental businesses
🧰 ToolsMay 5, 2026https://www.producthunt.com/products/gecko?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
The Right Answer, the Wrong Direction: Why Transformers Fail at Counting and How to Fix It
Large language models often fail at simple counting tasks, even when the items to count are explicitly present in the prompt. We investigate whether this failure occurs because transformers do not represent counts internally, or because they cannot convert those representations into the correct outp...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03258v1
S^2tory: Story Spine Distillation for Movie Script Summarization
Movie scripts pose a fundamental challenge for automatic summarization due to their non-linear, cross-cut narrative structure, which makes surface-level saliency methods ineffective at preserving core story progression. To address this, we introduce S^2tory (Story Spine Distillation), a narratology-...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03244v1
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for comp...
📄 ResearchMay 5, 2026http://arxiv.org/abs/2605.03950v1