AI News Archive: June 2, 2026 — Part 22

Sourced from 500+ daily AI sources, scored by relevance.

Consistency Training Can Entrench Misalignment
Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify unde...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03810v1
Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models
Causal tracing of factual recall has been studied predominantly in dense transformer language models, where interventions localize information flow to layers or feed-forward modules. Sparse mixture-of-experts (MoE) language models introduce a sharper question: when a factual prediction is mediated b...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03780v1
Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings
As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03695v1
SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents
Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly cons...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03692v1
Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models
Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03624v1
SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems
Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03544v1
Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers
Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysi...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03398v1
A Benchmark for Semi-supervised Multi-modal Crowd Counting
This paper constructs the first benchmark on semi-supervised multi-modal crowd counting. To lay the foundation for this unexplored task, we first formulate the semi-supervised multi-modal setting and a standardized protocol that specifies the labeled-unlabeled data partition across different labeled...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03646v1
VidMsg: A Benchmark for Implicit Message Inference in Short Videos
Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03635v1
TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics
Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03626v1
SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition
Skeleton-based action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03610v1
BotLearn
#1 bot university, A2A community. Agents Learn, Humans Earn
🧰 ToolsJun 2, 2026https://www.producthunt.com/products/botlearn?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene via Rendering Fusion
Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to pro...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03581v1
Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03577v1
Learned Non-Maximum Suppression for 3D Object Detection
Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among de...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03568v1
\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation
Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-mod...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03564v1
Attend to Anything: Foundation Model for Unified Human Attention Modeling
Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generali...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03540v1
Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection
Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03508v1
Low-Frequency Shortcuts in Texture-Driven Visual Learning
Neural networks suffer from shortcut learning, where learned features generalize well to the training set but not to in-distribution (ID) or out-of-distribution (OOD) test sets. Existing studies are all based on a few standard benchmarks, which are shape-driven. Numerous application domains, however...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03493v1
IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection
Multimodal fake news detection aims to identify the authenticity of news. Existing multimodal fake news detection methods mainly focus on cross-modal consistency, but often fail to explicitly model the semantic incongruity that characterizes deceptive multimodal content. However, misinformation ofte...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03418v1
A unified multi-task framework enables interpretable chest radiograph analysis
While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transfor...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03417v1
Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network
In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advant...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03341v1
IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension
Self-supervised learning (SSL) has emerged as a powerful paradigm for learning meaningful representations from unlabeled data. However, the standard protocol for evaluating these representations, linear probing, is computationally expensive, sensitive to hyperparameters, and provides limited insight...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03338v1
TASE: Truncation-Aware Semantic Embeddings for 3D Scene Understanding and Editing
High-fidelity semantic 3D scene representations are crucial for numerous applications, including robotics, autonomous driving, and simulation. Beyond this, the ability to edit such representations enables developers to adapt these applications more easily to specific target scenarios. Current approa...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03314v1
SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series
We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03301v1
VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch
Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03273v1
Reinforcement Learning from Cross-domain Videos with Video Prediction Model
Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03201v1
PlateLens
The best and most accurate AI Calorie Counter app.
🧰 ToolsJun 2, 2026https://www.producthunt.com/products/platelens-the-best-ai-calorie-counter?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Graph Regularized Non-negative Reduced Biquaternion Matrix Factorization for Color Image Recognition
Non-negative reduced biquaternion matrix factorization (NRBMF) uses the product of reduced biquaternion (RB) matrices to incorporate the non-negativity constraints of color image pixels into the factorization process. However, NRBMF mainly focuses on reconstruction accuracy and does not exploit the ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03654v1
PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models
Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03598v1
Diffusing in the Right Space: A Systematic Study of Latent Diffusability
Latent diffusion models leverage visual tokenizers to compress images into latent spaces for efficient generative modeling. However, better reconstruction quality of a tokenizer does not necessarily translate into better generation quality, suggesting that latent representations should be evaluated ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03578v1
When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics
Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03569v1
Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis
Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitu...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03566v1
Knowledge-Preserved Model Tuning in Null-Space for Robust Spatio-Temporal Video Grounding
Spatio-Temporal Video Grounding aims to localize object tubes based on textual queries. While recent methods have achieved remarkable success, they mainly focus on high-quality(HQ) inputs, neglecting the widespread presence of low-quality(LQ) videos in real-world scenarios. Although tuning methods l...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03539v1
EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation
Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibiti...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03509v1
Characterizing Detectability in 3DGS Poisoning: A Stage-wise Benchmark
3D Gaussian Splatting (3DGS) has rapidly emerged as a leading representation for real-time novel view synthesis, but recent work shows it is vulnerable to diverse poisoning attacks, including illusory object injection, computation cost amplification, and post hoc model watermarking. Despite this exp...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03499v1
PersistGS: Differentiable Physics for Object Permanence in 4D Gaussian Splatting
Dynamic 3D Gaussian Splatting (3DGS) methods reconstruct time-varying scenes from synchronized multi-camera video using photometric supervision. When a moving object becomes fully occluded from all training cameras, this supervision vanishes: the Gaussians representing it receive no gradient signal ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03479v1
PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization
Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \textbf{PRISM}, a novel dual-stream Mixture-...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03444v1
PHAF-Personalized Hand Avatars in a Flash
We present PHAF-Personalized Hand Avatars in a Flash, a personalized photo-realistic hand avatar which provides high quality multi-view renders from just two images (dorsal and palmar views).Unlike slow optimization-based techniques, PHAF generates fast personalized textures for real-time deployment...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03420v1
Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams
Engineering diagrams pose a distinct challenge for vision-language models: unlike natural images or general documents, they encode information through dense spatial layouts, domain-specific symbols, and cross-references between visual callouts and structured parts tables. Despite their centrality to...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03410v1
SAMatcher: Co-Visibility Modeling with Segment Anything for Robust Feature Matching
Reliable correspondence estimation is a fundamental problem in image processing, underpinning applications such as Structure from Motion, visual localization, and image registration. Existing learning-based methods have significantly improved local feature representations, yet most still operate at ...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03406v1
Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation
Audio-driven human motion video generation aims to synthesize realistic and temporally coherent human animations from a single static image, with applications in talking-head synthesis, co-speech gesture generation, and dynamic presentations. Moving beyond conventional keypoint-based methods that of...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03402v1
SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation
Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: synthetic credibility. We introduce SYNCRED-Bench, a benchmark of 600 AI-generated misinformation images balanced across six credible-form categories and seven fi...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03348v1
Voice Sync AI Teleprompter
Voice-led teleprompter that follows your speech
🧰 ToolsJun 2, 2026https://www.producthunt.com/products/voice-sync-ai-teleprompter?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Beyond Semantics: Modeling Factual and Affective Perceptual Experiences from Vision-Language Data
We present P-Topics (Perception Topics) modeling, a novel problem for understanding how images are perceived affectively and across cultures. The goal is to (1) discover and model the different perception experiences in a dataset of images and captions, where each experience is defined by an objecti...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03345v1
BA-T: An Iterative Transformer for Two-View Bundle Adjustment
Feed-forward models for 3D reconstruction have achieved strong performance using deep cross-view attention to exchange information across images. However, these approaches often depend on heavy decoder stacks and lack a structured mechanism for geometry refinement, resulting in poor multi-view consi...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03287v1
PaddleOCR-VL-1.6: Expanding the Frontier of Document Parsing with Under-Optimized Region Refinement and Progressive Post-Training
We introduce PaddleOCR-VL-1.6, an upgraded compact document parsing model built upon PaddleOCR-VL-1.5. Although PaddleOCR-VL-1.5 establishes a strong 0.9B baseline, its remaining errors concentrate in under-optimized regions where model behavior is unstable, data coverage is sparse, or supervision i...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03264v1
FreeStreamGS: Online Feed-forward 3D Gaussian Splatting from Unposed Streaming Inputs
Feed-forward 3D Gaussian Splatting (3DGS) allows efficient and high-fidelity novel view synthesis (NVS) from an offline recorded image sequence. However, achieving online NVS from streaming and unposed image inputs remains challenging. Although online feed-forward geometric estimation methods have b...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03254v1
MariData: One-Step Unpaired Image Translation for Maritime Environments
The development on robust perception systems for Maritime Autonomous Surface Ships (MASS) is heavily constrained by the scarcity of diverse training data, particularly for adverse weather and low-light conditions. Because collecting paired images in dynamic maritime environments is physically imposs...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03246v1
MemoGen: Can Past Experience Improve Future Text-to-Image Generation?
Modern text-to-image models have achieved strong visual synthesis, yet remain unreliable when prompts require implicit visual constraints, relational reasoning, or external knowledge. Existing retrieval-augmented and agentic generation methods mitigate this issue by acquiring external knowledge, ref...
📄 ResearchJun 2, 2026http://arxiv.org/abs/2606.03243v1