AI News Archive: June 16, 2026 — Part 21

Sourced from 500+ daily AI sources, scored by relevance.

Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors
Contrastive Language-Image Pre-training models are widely reused across downstream interfaces, including feature extraction, retrieval, reranking, and selection. Existing CLIP backdoor, however, usually validate attacks on a small attack-native task, leaving unclear whether the same poisoned checkpo...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17815v1
The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports
AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generat...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17791v1
EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent
As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17698v1
LLMs Infer Cultural Context but Fail to Apply It When Responding
Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultur...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17688v1
Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation
While large language models (LLMs) have greatly advanced the functional correctness of automated code translation systems, the runtime efficiency of translated programs has received comparatively little attention. With the waning of Moore's law, runtime efficiency has become increasingly important f...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17683v1
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning
Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17682v1
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contai...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17680v1
Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17645v1
Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs
Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17634v1
Variable-Width Transformers
Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing dis...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18246v1
Unintended Effects of Geographic Conditioning in Large Language Models
Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate location leakage: the phenomenon where a model generates geographic references despite rec...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18124v1
Learning task-specific subspaces via interventional post-training of speech foundation models
Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on onl...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17967v1
ChLogic: Evaluating Robustness of Logical Reasoning in Chinese Expressions
Large language models perform increasingly well on standardized logical reasoning benchmarks, but whether this ability remains robust beyond English is unclear. We introduce ChLogic, an English--Chinese aligned benchmark that tests whether models preserve logical reasoning performance when the same ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17905v1
GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?
Game generation is an emerging application of coding agents, requiring models to transform natural-language specifications into playable interactive systems. Unlike traditional coding tasks, game generation takes place within a game engine, where scripts, scenes, assets, rendering, and runtime inter...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17861v1
Recursive Scaling in Masked Diffusion Models
Masked diffusion models (MDMs) have recently emerged as a promising paradigm for sequence generation. Scaling MDMs is conventionally achieved by increasing the parameter count or the number of denoising steps. We introduce Recursive Masked Diffusion Models (R-MDMs), which add recursive depth as a th...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18022v1
Perceptual compensation for tonal context in self-supervised speech models
This study examines the extent to which the wav2vec2.0 architecture exhibits evidence of compensation for phonological context. We conducted a pseudo-replication of a perceptional compensation experiment on Mandarin Chinese tones, and compared the embedding similarities and probing classifier output...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17835v1
Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient Avatars
Training psychotherapists in evidence-based interventions such as Acceptance and Commitment Therapy (ACT) requires repeated practice with meaningful feedback, yet opportunities for safe, standardized training are limited by ethical, logistical, and resource constraints. We introduce a system designe...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17786v1
Vision-language models for chest radiography do not always need the image
Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a ca...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17710v1
SuCo: Sufficiency-guided Continuous Adaptive Reasoning
Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tie...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17687v1
MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block
Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17650v1
OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation
Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experienc...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17628v1
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressi...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18249v1
Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners
Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually c...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18198v1
Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology
Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18123v1
HLS-GPT: A Generative Pretrained Transformer (GPT) for Continental-Scale NASA Harmonized Landsat and Sentinel-2 (HLS) Reflectance Reconstruction Across All Bands on Arbitrary Dates
Recent deep learning methods for Landsat and Sentinel-2 reflectance time series reconstruction remain limited by restricted spectral coverage, limited geographic scalability, or patch-based designs with short temporal contexts. We present HLS-GPT, a large-scale generative pretrained Transformer mode...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18115v1
Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System
Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18112v1
Gaussian Light Field Splatting: A Physical Prior-Driven Vision Transformer for Unsupervised Low-Light Image Enhancement
Existing unsupervised low-light image enhancement methods often encounter local exposure imbalance and color distortion under complex non-uniform illumination. In addition, most Vision Transformers lack an explicit mechanism for modeling the physical priors of illumination degradation. To address th...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17985v1
SegDINO: Introducing Multi-Scale Structure into DINO for Efficient Medical Image Segmentation
Self-supervised DINO models provide strong transferable visual representations, yet applying them directly to image segmentation remains challenging. Existing approaches commonly rely on heavy decoders with complex upsampling, introducing substantial parameter and computational overhead. We observe ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17972v1
Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation
Semi-supervised medical image segmentation has emerged as a dominant research problem in medical image analysis, mitigating annotation scarcity by leveraging consistency regularization on unlabeled data. However, existing approaches operate predominantly via visual pattern matching, relying heavily ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17958v1
MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias
When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17953v1
Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model
Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17950v1
A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease
Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17867v1
High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach
Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelit...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17836v1
Million-scale multimodal pollen microscopy with expert-guided foundation models
Automated pollen identification from microscopy remains a bottleneck in aerobiology, palaeoecology and biodiversity monitoring, because scalable systems must generalise across specimen preparation, scanner settings and geographic origins while retaining palynological interpretability. To address thi...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17809v1
LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams
Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiv...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17798v1
BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics
Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative mod...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17742v1
GSPan: A Continuous Gaussian Primitive Representation for Arbitrary-Scale Pansharpening
Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose G...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17722v1
SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology
Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17702v1
See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL
Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-base...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17678v1
ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI
Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks ...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17639v1
Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion
Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morp...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18250v1
EventDrive: Event Cameras for Vision-Language Driving Intelligence
Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement t...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18242v1
EgoCS-400K: An Egocentric Gameplay Dataset for World Models
The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18180v1
Neural Tree Reconstruction for the Open Forest Observatory
The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for fore...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18153v1
PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution
Visual attribution is a fundamental tool for interpreting modern vision and vision-language models, particularly when their decisions must be inspected, diagnosed, or audited. Its goal is to explain how a model's decision depends on local regions of the visual input, typically by assigning an import...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.18008v1
Glint
Claude Code activity, right where you want it.
🧰 ToolsJun 16, 2026https://www.producthunt.com/products/glint-9?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
AIGS-Net: Compact Illumination Field Modeling via 2D Gaussian Splatting for Fast Low-Light Image Enhancement
Existing low-light image enhancement methods often face a bottleneck between the representation capacity of illumination-field modeling and computational complexity. To address this issue, this paper proposes an Adaptive Illumination Gaussian Splatting Network (AIGS-Net), an ultra-lightweight archit...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17998v1
Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation
Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semanti...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17966v1
Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation
Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexibl...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17961v1
MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization
Online 3D reconstruction from monocular image sequences is a challenging and ongoing research topic. 3D Gaussian Splatting (3DGS), leveraging its high-quality real-time rendering capability, empowers online 3D reconstruction to represent dense scenes with enhanced expressiveness, and thus holds grea...
📄 ResearchJun 16, 2026http://arxiv.org/abs/2606.17935v1