AI News Archive: June 8, 2026 — Part 22

Sourced from 500+ daily AI sources, scored by relevance.

From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design
Recursive self-design refers to AI-assisted modification of the mechanisms by which an AI system is built, evaluated, and improved. This paper treats MetaAI not as a mature paradigm, but as a working term for a human-seeded, AI-expanded development pattern in which the design space itself becomes a ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09663v1
Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis
We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive join...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09646v1
ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies
Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09630v1
Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes
Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09607v1
Seeing the Hivemind: A Consensus-Aware Interaction Technique for Mitigating AI Homogenization
People are increasingly using AI for creative tasks such as writing. While adoption continues to grow, this form of use risks undermining individual creativity locally and reducing the heterogeneity of creative output at scale. In response, we introduce the Semantic Repulsion Technique (SRT) and eva...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09587v1
CT-VAM: A Cerebello-Thalamic-Inspired Vision-Action Model for Efficient Visuomotor Control
Vision-language-action models have shown strong promise for robot manipulation, yet raw language is primarily needed to specify task intent rather than to be repeatedly processed during high-frequency low-level execution. Motivated by this separation, we propose a cerebello-thalamic-inspired vision-...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09572v1
AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation
AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09556v1
LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models
Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficie...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09430v1
WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
Computer-use agents (CUAs) increasingly operate in runtimes that combine visual desktop control, command-line execution, code editing, browsers, and external tools. Existing benchmarks, however, often evaluate these interfaces as separable capabilities, leaving long-horizon cross-interface orchestra...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09426v1
Harness Engineering for Physical AI: Robot Middleware Is the Harness Layer
Robot middleware faces a new role in the era of Physical AI. Learned policies, planners, and vision-language-action (VLA) models now enter deployed robots as causal participants on the control path, but the layer that integrates them with timing, scheduling, and network has not been named. Recent la...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09416v1
AI Assurance in UK Defence: Challenges in Operationalising JSP 936
This report examines practical challenges in operationalising JSP 936 Part 1 for AI assurance in UK Defence. Using a structured interpretive review of the directive's requirements, the analysis identifies eight thematic challenge areas adequacy of evidence and argument, management of human interacti...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09414v1
Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading
Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulat...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09667v1
Beyond Accuracy: Community Perspectives on Machine Translation
Despite remarkable progress in machine translation (MT), non-AI communities have raised growing concerns about MT systems, suggesting a noticeable gap between technical advancement and the needs of real-world users. For instance, while NLP researchers focus on benchmark performance, end users care a...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09655v1
Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving
Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a mode...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09644v1
Gradient-Guided Reward Optimization for Inference-time Alignment
Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two k...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09635v1
Civil Court Simulation with Large Language Models
Court simulation bridges legal education and judicial practice, yet human-based simulations are costly and difficult to scale. Large language models (LLMs) offer a scalable alternative, but existing court-simulation research mainly focuses on criminal cases. Civil litigation is more common in practi...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09632v1
Code Is More Than Text: Uncertainty Estimation for Code Generation
Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing c...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09577v1
Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages
Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09535v1
Self-Harness: Harnesses That Improve Themselves
The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineer...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09498v1
Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph Isomorphism
Large language models (LLMs) have shown impressive performance on diverse reasoning tasks, yet their capacity for structural reasoning in graphs remains unclear. We investigate whether LLMs can genuinely understand graph isomorphism -a fundamental problem in graph theory. While LLMs achieve near-per...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09484v1
Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios
Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limi...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09428v1
What Should a Skill Remember? Quality-Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents
Large language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing spa...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09421v1
PriFT: Prior-Support Guided Supervised Fine-Tuning
Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09396v1
LexRubric: A Rubric-Guided Diagnostic Benchmark for Open-Ended Legal Tasks
As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation th...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09389v1
Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given promp...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09380v1
Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs
Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which b...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09366v1
Experience Makes Skillful: Enabling Generalizable Medical Agent Reasoning via Self-Evolving Skill Memory
Medical agent systems are increasingly expected to support interactive clinical decision making rather than only static question answering. In such settings, effective agents must reuse prior experience across evolving cases, yet existing memory mechanisms often retain raw historical traces that are...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09365v1
In-Context Learning for the Imputation of Public Opinion Data with Large Language Models
Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing val...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09351v1
PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment
Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is es...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09348v1
Multi-Hop Knowledge Composition is Bound by Pretraining Exposure
Large Language Models fail at implicit multi-hop reasoning: a model answers "When was $X$ born?" and "Who is $Y$'s closest friend?" correctly but fails on "When was $Y$'s closest friend born?" in a single forward pass, even when both facts are perfectly memorized and individually retrievable. We stu...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09338v1
How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?
Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, com...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09334v1
SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling
On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequen...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09304v1
When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following
Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09662v1
Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion
Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09603v1
Clinically Grounded Privacy Evaluation of Medical LMs
Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adv...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09590v1
Wearit — Try before you buy
Your AI stylist that learns and dresses you perfectly
🧰 ToolsJun 8, 2026https://www.producthunt.com/products/wearit-try-before-you-buy?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
UXBench: Benchmarking User Experience in AI Assistants
As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09570v1
OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages
Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09553v1
Escaping the KL Agreement Trap in On-Policy Distillation
On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training s...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09471v1
DECSELFMASK: Leveraging Unlabeled Text via Self-Relevance-Guided Masking for Decoder-Only Classification
Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an app...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09466v1
Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization
Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09449v1
MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models
Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, c...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09435v1
Toward Signing Activity Projection in Sign Language Interaction
Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projecti...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09424v1
Capacity, Not Format: Rethinking Structured Reasoning Failures
Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length co...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09410v1
Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle
Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a mod...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09376v1
NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech
Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplor...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09295v1
One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems
Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complem...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09293v1
Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications
With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time p...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09569v1
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?
Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task gu...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09547v1
Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation
Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and letting entropy-based detection algorithms fail. Previous image classification works have demonstrated that Hadamard-coded output representations can improve adversarial robustness. However, at...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09536v1