AI News Archive: May 4, 2026 — Part 11

Sourced from 500+ daily AI sources, scored by relevance.

Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
Shortcut learning causes deep learning models to rely on non-essential features within the data. However, its formation in deep neural network training still lacks theoretical understanding. In this paper, we provide a formal definition of core and shortcut features and employ evolutionary game theo...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02658v1
Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
As artificial intelligence (AI), including machine learning (ML) models and foundation models (FMs), is increasingly deployed in high-stakes domains, ensuring their trustworthiness has become a central challenge. However, the core trustworthy AI objectives, such as fairness, robustness, privacy, and...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02640v1
SCGNN: Semantic Consistency enhanced Graph Neural Network Guided by Granular-ball Computing
Capturing semantic consistency among nodes is crucial for effective graph representation learning. Existing approaches typically rely on $k$-nearest neighbors ($k$NN) or other node-level full search algorithms (FSA) to mine semantic relationships via exhaustive pairwise similarity computation, which...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02617v1
Validation of an AI-based end-to-end model for prostate pathology using long-term archived routine samples
Artificial intelligence (AI) is becoming a clinical tool for prostate pathology, but generalization across variations in sample preparation and preservation over prolonged time periods remains poorly understood. We evaluated GleasonAI, an end-to-end attention-based multiple instance learning model, ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02614v1
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
While Large Language Models (LLMs) and Vision-Language Models (VLMs) demonstrate remarkable capabilities in high-level reasoning and semantic understanding, applying them directly to contact-rich manipulation remains a challenge due to their lack of explicit physical grounding and inability to perfo...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02600v1
IConFace: Identity-Structure Asymmetric Conditioning for Unified Reference-Aware Face Restoration
Blind face restoration is highly ill-posed under severe degradation, where identity-critical details may be missing from the degraded input. Same-identity references reduce this ambiguity, but mismatched pose, expression, illumination, age, makeup, or local facial states can lead to overuse of refer...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02814v1
Fine-Grained Graph Generation through Latent Mixture Scheduling
Structure aware graph generation aims to generate graphs that satisfy given topological properties. It has applications in domains such as drug discovery, social network modeling, and knowledge graph construction. Unlike existing methods that only provide coarse control over graph properties, we int...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02780v1
TOC-SR: Task-Optimal Compact diffusion for Image Super Resolution
Diffusion models have recently demonstrated strong performance for image restoration tasks, including super-resolution. However, their large model size and iterative sampling procedures make them computationally expensive for practical deployment. In this work, we present TOC-SR, a framework for bui...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02767v1
Mitigating Misalignment Contagion by Steering with Implicit Traits
Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior sprea...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02751v1
Triple Spectral Fusion for Sensor-based Human Activity Recognition
The field of sensor-based human activity recognition (HAR) mainly uses posture, motion and context data of Inertial Measurement Units (IMUs) to identify daily activities. Despite the advancements in learning-based methods, it is challenging to perform information fusion from the temporal perspective...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02743v1
Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medic...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02740v1
AI and Open-data Driven Scalable Solar Power Profiling
Solar photovoltaic (PV) deployment is expanding rapidly, yet detailed, up-to-date information on the spatial distribution and capacity of rooftop PV remains limited. This paper presents an open, scalable framework for detecting solar panels from open data and generating city-level solar power profil...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02738v1
Coherent Hierarchical Multi-Label Learning to Defer for Medical Imaging
Learning to Defer (L2D) enables a model to predict autonomously or defer to an expert, but prior work largely assumes flat label spaces. We study the first L2D setting with hierarchical multi-label decisions, motivated by medical-imaging workflows in which findings are organised by clinical taxonomi...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02734v1
ProPACT: A Proactive AI-Driven Adaptive Collaborative Tutor for Pair Programming
Effective pair programming depends on coordination of attention, cognitive effort, and joint regulation over time, yet most adaptive learning systems remain individual-centric and reactive. This paper introduces ProPACT, a proactive AI-driven adaptive collaborative tutor that treats collaboration it...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02703v1
Fuzzy Fingerprinting Encoder Pre-trained Language Models for Emotion Recognition in Conversations: Human Assessment and Validity Study
In Emotion Recognition in Conversations (ERC), model decisions should align with nuanced human perception and ideally provide insights on the classification process. Standard encoder pre-trained language models (PLMs) are the state-of-the-art at these tasks but offer little insight into why a certai...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02665v1
AcademiClaw: When Students Set Challenges for AI Agents
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' rea...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02661v1
ViewSAM: Learning View-aware Cross-modal Semantics for Weakly Supervised Cross-view Referring Multi-Object Tracking
Cross-view Referring Multi-Object Tracking (CRMOT) aims to track multiple objects specified by natural language across multiple camera views, with globally consistent identities. Despite recent progress, existing methods rely heavily on costly frame-level spatial annotations and cross-view identity ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02638v1
Counterfactual Reasoning in Automated Planning
Automated planning traditionally assumes that all aspects of a planning task (initial state, goals, and available actions) are fully specified in advance, an approach well-suited to domains with fixed rules and deterministic execution. However, real-world planning often requires flexibility, allowin...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02603v1
Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems t...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02801v1
FunFuzz: An LLM-Powered Evolutionary Fuzzing Framework
Modern fuzzers increasingly use Large Language Models (LLMs) to generate structured inputs, but LLM-driven fuzzing is sensitive to prompt initialization and sampling variance, which can reduce exploration efficiency and lead to redundant inputs. We present FunFuzz, a multi-island evolutionary fuzzin...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02789v1
ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety alignment and elicit harmful responses. A growing body of work shows that contextual priming, where earlier turns covertly bias later replies, constitutes a powerful attack surface, with hand-crafted multi-turn sc...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02647v1
Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race
Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-i...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02620v1
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al. 2024), covering more than 30 language-culture pairs, predominantly represen...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02601v1
Benchmarking Retrieval Strategies for Biomedical Retrieval-Augmented Generation: A Controlled Empirical Study
Retrieval-Augmented Generation (RAG) offers a well-established path to grounding large language model (LLM) outputs in external knowledge, yet the question of which retrieval strategy works best in a high-stakes domain such as biomedicine has not received the controlled, multi-metric treatment it de...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02520v1
Revisiting Semantic Role Labeling: Efficient Structured Inference with Dependency-Informed Analysis
Semantic Role Labeling (SRL) provides an explicit representation of predicate-argument structure, capturing linguistically grounded relations such as who did what to whom. While recent NLP progress has been dominated by large language models (LLMs), these systems often rely on implicit semantic repr...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02505v1
A multilingual hallucination benchmark: MultiWikiQHalluA
Most hallucination evaluations focus on English, leaving it unclear whether findings transfer to lower-resource languages. We investigate faithfulness hallucinations, defined as model-generated content that is fluent and plausible but diverges from the provided input or is internally inconsistent. L...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02504v1
Semantic Risk-Aware Heuristic Planning for Robotic Navigation in Dynamic Environments: An LLM-Inspired Approach
The integration of Large Language Model (LLM) reasoning principles into classical robot path planning represents a rapidly emerging research direction. In this paper, we propose a Semantic Risk-Aware Heuristic (SRAH) planner that encodes LLM-inspired cost functions penalising geometrically cluttered...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02862v1
Salessims
AI-driven conversational role-play to practice sales calls
🧰 ToolsMay 4, 2026https://www.producthunt.com/products/salessims?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures
This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogene...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02270v1
Reliability-Oriented Multilingual Orthopedic Diagnosis: A Domain-Adaptive Modeling and a Conceptual Validation Framework
Large Language Models (LLMs) are increasingly proposed for clinical decision support including multilingual diagnosis in low-resource settings. However, their reliability, calibration and safety characteristics remain insufficiently understood for structured, high-risk tasks. We present a system-lev...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02266v1
Zero-Shot Confidence Estimation for Small LLMs: When Supervised Baselines Aren't Worth Training
How reliably can a small language model estimate its own correctness? The answer determines whether local-to-cloud routing-escalating queries a cheap local model cannot handle-can work without supervised training data. As inference costs dominate large language model (LLM) deployment budgets, routin...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02241v1
Perturbation Dose Responses in Recursive LLM Loops: Raw Switching, Stochastic Floors, and Persistent Escape under Append, Replace, and Dialog Updates
Recursive language-model loops often settle into recognizable attractor-like patterns. The practical question is how much injected text is needed to move a settled loop somewhere else, and whether that move lasts. We study this in 30-step recursive loops by separating the model from the context-upda...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02236v1
Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction
We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02234v1
CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse
In this paper, we present our system for SemEval-2026 Task 6 (CLARITY) on response clarity and evasion detection in question-answer pairs from U.S. presidential interviews, comparing fine-tuned encoders with prompt-based LLMs. Our LLM ensemble achieves 80 macro-F1 on the 3-class Task 1 (9th/41) and ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02170v1
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
Pretraining optimizers are tuned to produce the strongest possible base model, on the assumption that a stronger starting point yields a stronger model after subsequent changes like post-training and quantization. This overlooks the geometry of the base model which controls how much of the base mode...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02105v1
FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents
Text-to-SQL over large analytical databases requires navigating complex schemas, resolving ambiguous queries, and grounding decisions in actual data. Most current systems follow a fixed pipeline where schema elements are retrieved once upfront and the database is only revisited for post-hoc repair, ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02815v1
PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature
Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access ar...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02720v1
Synthetic Users, Real Differences: an Evaluation Framework for User Simulation in Multi-Turn Conversations
There is growing interest in exploring user simulation as an alternative to gathering and scoring real user-chatbot interactions for AI chatbot evaluation. For this purpose, it is important to ensure the realism of the simulation, i.e., the extent to which simulated dialogues reflect real dialogues ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02624v1
Dependency Parsing Across the Resource Spectrum: Evaluating Architectures on High and Low-Resource Languages
Transformer-based models achieve state-of-the-art dependency parsing for high-resource languages, yet their advantage over simpler architectures in low-resource settings remains poorly understood. We evaluate four parsers -- the Biaffine LSTM, Stack-Pointer Network, AfroXLMR-large, and RemBERT -- ac...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02608v1
Shadow-Loom: Causal Reasoning over Graphical World Model of Narratives
Stories hold a reader's attention because they have causes, secrets, and consequences. Shadow-Loom is an experimental open-source framework that turns a narrative into a versioned graphical world model and lets two engines act on it: a causal physics grounded in Pearl's ladder of causation and a rec...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02475v1
Accurate Legal Reasoning at Scale: Neuro-Symbolic Offloading and Structural Auditability for Robust Legal Adjudication
Legal texts often contain computational legal clauses--provisions whose understanding requires complex logic. While frontier Large Reasoning Models (LRMs) can describe such clauses, building production-ready systems is limited by reasoning errors and the high cost of inference. We propose Amortized ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02472v1
Leveraging Argument Structure to Predict Content Hatefulness
Information disorder is a challenging phenomenon that affects society at large. This phenomenon entails the diffusion of misleading, misinforming, and hateful content online. In different contexts, one aspect of the problem may prevail, but overall, this is a broad problem that requires comprehensiv...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02457v1
Measuring AI Reasoning: A Guide for Researchers
In this paper, we offer a guide for researchers on evaluating reasoning in language models, building the case that reasoning should be assessed through evidence of adaptive, multi-step search rather than final-answer accuracy alone. Under an evaluation-oriented definition, reasoning requires selecti...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02442v1
Automatic Reflection Level Classification in Hungarian Student Essays
Reflective thinking is a key competency in education, but assessing reflective writing remains a time-consuming and subjective task for education experts. While automated reflective analysis has been explored in several languages, Hungarian language was not researched extensively. In this paper, we ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02402v1
The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure
As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability -- knowing what they do not know, detecting errors, seeking clarification -- under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detect...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02398v1
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval
Novelty assessment is a critical yet complex task in the examination process for patent acceptance, requiring examiners to determine whether an invention is disclosed in a prior art document. The process involves intricate matching between specific features of a patent claim and passages in the prio...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02392v1
Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threa...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02374v1
When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models
Deployed language models must produce outputs that are both correct and format-compliant. We study this structured-output reliability gap using two mathematical benchmarks -- GSM8K and MATH -- as a controlled testbed: ground truth is unambiguous and the output contract is strict (JSON with required ...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02363v1
Compositional Multi-hop Factual Error Correction via Decomposition-and-Injection
Factual Error Correction (FEC) aims to revise inaccurate text into statements that are factually consistent with external evidence. Although recent methods perform well on single-hop correction, they often treat claims as atomic units and struggle with multi-hop cases that require compositional reas...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02277v1
WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization
Recently, video language models (VLMs) have been applied in various fields. However, the visual token sequence of the VLM is too long, which may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in VLMs based on...
📄 ResearchMay 4, 2026http://arxiv.org/abs/2605.02262v1