AI News Archive: June 3, 2026 — Part 14

Sourced from 500+ daily AI sources, scored by relevance.

Food-R1: A Unified Multi-Task Food Vision-Language Model with Reinforcement Learning
Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To addr...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04986v1
Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance
We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect r...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04970v1
Toward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling
Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We address these challen...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04920v1
BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine
Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, ta...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04911v1
CDPM-Align: Multi-Scale Guidance-Aligned Diffusion Pretraining for Robust Few-Shot Anatomical Landmark Detection
Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and ro...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04898v1
Recent Advances and Trends in Learning-based 3D Representations
The selection of an appropriate 3D representation is a fundamental design decision that dictates the efficiency, quality, and capabilities of modern computer vision and graphics pipelines for tasks such as 3D reconstruction, novel-view synthesis and rendering, shape and motion analysis, recognition,...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04871v1
IRIS-GAN: Staged Specialist Detection of Deepfake Faces
We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train th...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04863v1
MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU
Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaC...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04847v1
NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning
LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue bot...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04806v1
A Pathology Foundation Model for Gastric Cancer with Real-World Validation
Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone ...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04792v1
Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives
Visual localization -- estimating a camera pose within a pre-existing map -- is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridg...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04788v1
Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms
The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04767v1
StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT
Ischemic stroke is a major global disease. Treatment decisions are highly time-sensitive, as eligibility for reperfusion therapies relies on the interval between stroke onset and intervention. However, the true onset time is often uncertain in clinical practice, necessitating imaging-based assessmen...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04722v1
Data Efficient Complex Feature Fusion Network For Hyperspectral Image Classification
This work presents a data-efficient variant of the Attention-Based Dual-Branch Complex Feature Fusion Network (CFFN) for hyperspectral image classification. The proposed model, termed DE-CFFN, retains the original two-stream structure: the Real-Valued Neural Network (RVNN) processes standard hypersp...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04710v1
Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation
Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point pr...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04705v1
Controllable Dynamic 3D Shape Generation via 3D Trajectories and Text
We introduce T2Mo, a feed-forward framework for controllable dynamic 3D shape generation conditioned on 3D trajectories and text. Due to the inherent ambiguity of language, generating precisely intended motions using text alone remains challenging. To address this, we adopt 3D trajectories as contro...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05162v1
GeM-NR: Geometry-Aware Multi-View Editing for Nonrigid Scene Changes
Recent developments in multi-view image editing with generative models have brought us a step closer toward general 3D content generation and customization. Most existing works focus on rigid or appearance-only edits by utilizing the geometry of the unedited scene. This naturally limits these method...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05142v1
Geometry Gaussians: Decoupling Appearance and Geometry in Gaussian Splatting
After the success of 3D Gaussian Splatting (3DGS) for novel view synthesis, many works have explored how to also use it for geometric surface representation. However, extracting accurate geometric information directly from 3DGS remains challenging and can often reduce the appearance rendering qualit...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05124v1
ZipSplat: Fewer Gaussians, Better Splats
Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured ob...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05102v1
InstantRetouch: Efficient and High-Fidelity Instruction-Guided Image Retouching with Bilateral Space
Language-guided photo retouching aims to adjust color and tone while preserving geometry and texture. Recently, diffusion-based retouching shows a superior visual quality, but often struggles with both fidelity issues due to its generative nature and efficiency because of its iterative sampling proc...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05071v1
Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping
Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming v...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05035v1
CIPER: A Unified Framework for Cross-view Image-retrieval and Pose-estimation
Cross-view geo-localization estimates the geographic location of a ground image by matching it against an aerial image database. Existing methods tackle this through either large-scale retrieval or precise pose estimation, but not both: retrieval-based methods enable wide-area search at the cost of ...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05011v1
Multi-Camera AR Guidance System for Surgical Instrument Handling and Assembly: Investigating Workload and Efficiency
The handling and assembly of instruments during surgery imposes high cognitive demands on scrub nurses, particularly when instruments are unfamiliar. We present a supporting guidance system for surgical instrumentation that combines multi-camera 6D pose estimation with augmented reality in-situ visu...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04992v1
Scene-Centric Unsupervised Video Panoptic Segmentation
Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focuse...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04925v1
Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models
Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as ...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04922v1
Hierarchical Space Partition for Surface Reconstruction
Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To addr...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04891v1
HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition Scenarios
Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intellig...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04888v1
Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification
Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04844v1
3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks
Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis m...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04836v1
OA-CutMix: Correcting the Label Bias of CutMix
CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label cred...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04820v1
Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?
Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? W...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04811v1
Fast Cubical Persistent Homology on 2D and 3D Images via Union-Find, Pruning, and Lookup Tables
We present Flash Cubical, a highly efficient computation of cubical persistence on a V-filtration for 2D and 3D images over $\mathbb{F}_2$. The implementation is built around three core ideas. First, cubical complexes satisfy properties that allow for the computation of persistence of the highest di...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04801v1
Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization
Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04797v1
Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control
Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V st...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04775v1
NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models
Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to di...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04773v1
Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction
Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human vi...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04772v1
Do Foundation Models See Biology? Evaluating Attention Coherence with Spatial Transcriptomics in Glioblastoma
Whether attention maps from pathology foundation models capture genuine biology remains unknown, yet this question is critical for clinical trust and regulatory approval. We propose a spatial transcriptomics-based framework for orthogonal, hypothesis-free evaluation of attention and apply it to five...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04764v1
Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment
Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern r...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04737v1
ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection
AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, ...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04706v1
Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms
GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.04701v1
STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations
Training Data Attribution (TDA) seeks to trace a model's predictions back to its training data. The gold standard for TDA relies on causal interventions, observing how a model changes when data is added or removed, but repeated retraining is computationally challenging for Large Language Models (LLM...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05165v1
Reinforcement Learning from Rich Feedback with Distributional DAgger
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, includin...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05152v1
BuilderStudio
Agentic coding IDE for Mac
🧰 ToolsJun 3, 2026https://www.producthunt.com/products/builderstudio?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
BBOmix: A Tabular Benchmark for Hyperparameter Optimization of Unsupervised Biological Representation Learning
The rapid advancement of high-throughput sequencing has led to large, high-dimensional omics datasets. Deep unsupervised learning architectures, particularly Autoencoders (AEs), are increasingly used for dimensionality reduction and representation learning in this domain. However, AEs are highly sen...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05139v1
Activation-Based Active Learning for In-Context Learning: Challenges and Insights
Deep active learning has previously been explored for LLM in-context sample selection, but not with methods that utilise recent advances in understanding of transformer activations. In this paper, we test the hypothesis that model activations could provide a fine-grained signal to optimise the selec...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05134v1
Towards Efficient and Evidence-grounded Mobility Prediction with LLM-Driven Agent
Individual-level mobility prediction is central to urban simulation, transportation planning, and policy analysis. Supervised sequence models achieve strong accuracy but require task-specific training and offer limited decision-level transparency. Recent LLM-based methods improve interpretability, y...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05130v1
Graph Set Transformer
We introduce the Graph Set Transformer (GST), a neural network architecture for learning on sets of graphs, designed for tasks in which per-element predictions depend on set-wide context as well as local structure. Existing architectures, including DeepSets and SetTransformer, require pre-encoded gr...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05116v1
RePercENT: Scaling Disentangled Representation Learning Beyond Two Modalities
To leverage the full potential of multimodal data, we need representations that go beyond the state-of-the-art alignment and fusion approaches and exploit all cross-modal interactions without sacrificing modality-specific information. Learning disentangled representations is a principled way to iden...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05109v1
FoeGlass: Simple In-Context Learning Is Enough for Red Teaming Audio Deepfake Detectors
Audio deepfake detection (ADD) models are critical for countering the malicious use of text-to-speech (TTS) models. Evaluating and strengthening ADD models requires developing datasets that span the space of generated audio and highlight high-error regions. Existing dataset development strategies fa...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05101v1
Fast & Faithful Function Vectors
Function vectors (FVs) are task representations elicited during in-context learning that can be used to steer Large Language Models (LLMs). However, design choices in their formulation remain underexplored. In this work, we study the impact of varying FV definitions for instructions along two degree...
📄 ResearchJun 3, 2026http://arxiv.org/abs/2606.05079v1