AI News Archive: June 8, 2026 — Part 23

Sourced from 500+ daily AI sources, scored by relevance.

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization
Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09449v1
MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models
Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, c...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09435v1
Toward Signing Activity Projection in Sign Language Interaction
Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projecti...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09424v1
Capacity, Not Format: Rethinking Structured Reasoning Failures
Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length co...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09410v1
Precision Is Not Faithfulness: Coverage-Aware Evaluation of Grounded Generation with a Complete Oracle
Reference-free faithfulness metrics verify each atomic claim a model makes against ground truth, and are increasingly used to evaluate grounded generation. We show they share a blind spot: they measure only precision -- are the stated claims supported? -- and therefore reward abstention, since a mod...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09376v1
NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech
Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplor...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09295v1
One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems
Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complem...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09293v1
Efficient Minimal Solvers for Relative Pose Estimation in Autonomous Driving Applications
With the advancement of visual sensing systems, computer vision is playing an increasingly important role in autonomous driving and robot navigation. Relative pose estimation in multi-camera systems is essential for accurate vehicle localization and environment perception, demanding high real-time p...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09569v1
Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?
Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task gu...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09547v1
Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation
Conventional one-hot encodings often yield poorly calibrated models, being overconfident under attack, and letting entropy-based detection algorithms fail. Previous image classification works have demonstrated that Hadamard-coded output representations can improve adversarial robustness. However, at...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09536v1
Securing Self-supervised Data Curation for Foundation Models Robustness
Self-supervised data curation provides a pathway to scaling and improving the generalization capabilities of machine learning models. By leveraging self-supervised learning (SSL) for data curation, the demand for massive training datasets required by foundation models can be effectively met. SSL gre...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09511v1
Decentralize AI Hackathon
A global hackathon for the open AI future
🧰 ToolsJun 8, 2026https://www.producthunt.com/products/hacker-noon?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
ContextShift: A Controlled Benchmark for Context Dependence in Object Detection
Modern object detectors achieve strong performance on standard benchmarks, yet their robustness to contextual variation remains insufficiently understood. Prior evaluations largely rely on aggregate metrics such as AP on uncontrolled distribution shifts, which can obscure how performance degrades un...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09495v1
Efficient Minimal Solvers for Visual-Inertial Relative Pose Estimation in Multi-Camera Systems
Estimating the relative poses of multi-camera systems is a fundamental problem in computer vision, with critical applications in autonomous vehicles, mobile devices, and unmanned aerial vehicles (UAVs). However, existing solutions often suffer from high computational complexity or rely on an excessi...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09477v1
GD-MIL: Grade-Disentangled Multiple Instance Learning for Multimodal Biochemical Recurrence Prediction in Prostate Cancer
Biochemical recurrence (BCR) after radical prostatectomy is a critical endpoint in prostate cancer, yet risk stratification relies almost entirely on variables dominated by Gleason grade. Whether H&E whole slide images (WSIs) carry prognostic signal beyond grade, and whether multiple instance learni...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09453v1
CapRL++: Unified Reinforcement Learning with Verifiable Rewards for Dense Image and Video Captioning
Image and video captioning are fundamental tasks that bridge the visual and linguistic domains, playing a critical role in pre-training Large Vision-Language Models (LVLMs). Current state-of-the-art captioning models are typically trained with Supervised Fine-Tuning (SFT), a paradigm that relies on ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09393v1
Echo-DM: Ultrasound Marker Removal via Conditional Latent Diffusion and Region-Aware Fusion
Clinical ultrasound images often contain artificial markers, such as measurement calipers and text, to assist diagnostic interpretation and comparison. However, these markers can introduce shortcut bias in downstream automated analysis, encouraging deep learning models to rely on marker-related cues...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09378v1
RT-SDGOD: Real-Time Single-Domain Generalized Object Detection
In real-world deployment under strict real-time constraints, weather and imaging variations induce significant distribution shifts, severely degrading detectors. Single-Domain Generalized Object Detection aims to mitigate this issue, yet existing methods rarely investigate-at the level of problem fo...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09367v1
Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study
Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimod...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09362v1
Beyond Humans: Multispecies Animal Face Recognition Using Transfer Learning
Individual animal recognition can be useful in the search for lost or stolen pets, the tracking of individuals of endangered species, and the recognition of animals in crowded farms. Present recognition techniques mostly use physical devices, e.g., microchips, often impractical and difficult to appl...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09353v1
Taming Perception Jitter: Uncertainty-Aware LiDAR Object Detection for Reliable Motion Classification
Reliable motion classification is critical for autonomous driving, as false dynamic predictions of static objects can cascade into unnecessary planner interventions. Unstable bounding box predictions can lead to spurious velocity estimates in tracking and falsely predicted trajectories. We present a...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09350v1
IB-HFN: Information Bottleneck-Driven SAR-Optical Fusion Network for High-Fidelity Cloud Removal
Synthetic aperture radar (SAR)-assisted optical cloud removal aims to recover surface information obscured by clouds in optical remote sensing images by exploiting complementary SAR observations. Existing multimodal fusion methods typically rely on direct spatial concatenation and pixel-wise supervi...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09347v1
Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning
The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09303v1
Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning
Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is insta...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09290v1
EditSSC: Toward Editable Semantic Occupancy Scenes with Unconditional Diffusion Models
3D semantic scene generation is crucial for autonomous driving applications, yet most methods rely on complex 3D-specific architectures such as triplane encoders and adapted diffusion networks, limiting both their simplicity and their editing capabilities. We propose EditSSC, an editing-ready method...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09273v1
Self-supervised Learning Matters: A Simple Ensemble Solution for Micro-Gesture Recognition
In this paper, we present XInsight Lab's solution to the micro-gesture classification track of the 4th MiGA Challenge at IJCAI 2026, in which our solution ranked first and achieved a new state-of-the-art result. We propose a multimodal ensemble framework that integrates a self-supervised RGB-based m...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09261v1
MAGIS: Evidence-Based Multi-Agent Reasoning for Interpretable Strabismus Clinical Decision-Making
Strabismus is a common ocular disorder that requires fine-grained subtype diagnosis for individualized treatment planning. However, existing deep learning methods mainly provide diagnostic predictions without transparent reasoning, while recent large vision-language models (LVLMs), although promisin...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09249v1
FounderOS
AI Command Center for Solo Founders
🧰 ToolsJun 8, 2026https://www.producthunt.com/products/founderos-2?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Temporal-Aware Reasoning Optimization for Video Temporal Grounding
Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limi...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09248v1
Proposal Refinement for Few-Shot Object Detection
Few-shot object detection has gained widely attention in recent years. Some excellent algorithms have been proposed to handle this task. However, most of these algorithms rely on the performance of few-shot classification. Unlike previous attempts, our work focuses on the problem of unbalanced distr...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09245v1
MAVIS: Multi-Agent Video Retrieval via Structured Video Understanding
The dominant paradigm in video retrieval relies on embedding-based full-corpus scanning, which suffers from inherent computational inefficiency and the semantic asymmetry between information-dense videos and sparse textual queries. To bridge this gap, we introduce \textbf{MAVIS}, a novel multi-agent...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09641v1
CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation
The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training dat...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09639v1
DexPIE: Stable Dexterous Policy Improvement from Real-World Experience
Dexterous manipulation presents substantial challenges for imitation learning due to its high-dimensional action space and complex contact-rich dynamics. Policies trained purely from demonstrations often suffer from compounding errors during deployment and require large amounts of expert data to ach...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09615v1
TUDSR: Twice Upsampling-Diffusion for Higher Super-Resolution
Diffusion-based generative models have achieved remarkable success in real-world image super-resolution (SR). With tiled diffusion techniques, these models can produce high-resolution images that exceed their native-supported resolution. However, the quality of such high-resolution (e.g $2048^2$) ou...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09608v1
A VideoMAE-v2 Approach to Zero-Shot Traffic Accident Anticipation
Traffic accident anticipation -- predicting the likelihood of an imminent collision at every frame of a dashcam video -- is safety-critical yet difficult to scale, because collecting in-domain annotated accident footage for every deployment scenario is prohibitively expensive. We study this task und...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09542v1
SwiftVR: Real-Time One-Step Generative Video Restoration
Real-time video restoration (VR) for live streams requires high-resolution outputs under strict per-frame latency constraints. Existing one-step diffusion-based VR models remain difficult to deploy on consumer-grade GPUs due to two main bottlenecks: quadratic spatial attention at high resolutions an...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09516v1
Prisma-World: Camera-Controllable Multi-Agent Video World Model
Video world models have made rapid progress in generating controllable visual experiences, but most of them still simulate the world from a single observer. Extending such models to multiple agents raises a central challenge: if each agent's future state is generated independently, overlapping views...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09507v1
Training-Free Generalized Few-Shot Segmentation through Open-Vocabulary Semantic Arbitration
Generalized Few-Shot Semantic Segmentation (GFSS) has traditionally been approached as a representation-learning problem, requiring task-specific adaptation to incorporate novel classes from limited support examples. Recent foundation models, however, already exhibit strong open-vocabulary recogniti...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09474v1
ExDet: Open-Domain Open-Vocabulary Detection with Cross-modal Extrapolation and Rectification
Open-domain open-vocabulary detection (ODOVD) requires detectors to generalize to both novel categories and unseen domains, making it more challenging than open-vocabulary detection. Existing methods typically train open-vocabulary detectors together with domain generalization modules from scratch, ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09360v1
See More, Match Better: Multi-Source Feature Fusion for Two-View Correspondence Learning
Two-view correspondence learning aims to distinguish true correspondences (inliers) from false ones (outliers) in image pairs by leveraging their underlying differences. Existing methods mainly rely on coordinate-based geometric consistency. However, they often struggle with pseudo-consistent outlie...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09262v1
LiteVSR: Lightweight Adaptation of Frozen Diffusion Transformers for Video Super-Resolution
Adapting large-scale pre-trained video generators for Video Super-Resolution (VSR) in novel domains remains computationally prohibitive. Methods that reformulate generation as direct Low-Quality to High-Quality mappings deviate from the original generative formulation, demanding extensive fine-tunin...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09250v1
EgoTactile: Learning Grasp Pressure for Everyday Objects from Egocentric Video
Estimating full-hand grasp pressure from egocentric video is critical for immersive VR and robotic manipulation, yet dense tactile sensing often relies on intrusive hardware. Existing vision-based methods predominantly rely on planar surfaces or fingertip contacts, failing to generalize to complex 3...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09243v1
Semi-supervised Source Detection in Astronomical Images: New Benchmark and Strong Baseline
Source detection in modern observational astronomy is a cornerstone for localizing and identifying stellar sources accurately. It is crucial for studies such as stellar population synthesis and cosmological parameter estimation. However, the characteristics of astronomical images, including high den...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09219v1
Minimal Solvers for Full-DoF Motion Estimation from Asynchronous Differential SfM
As a bio-inspired intelligent sensor, event cameras have introduced a new paradigm in the intelligent perception of spatiotemporal information and visual motion estimation, characterized by their high temporal resolution, low latency, and minimal power consumption. However, their asynchronous data s...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09218v1
In-Context Learning for Latent Space Bayesian Optimization
Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art regression performanc...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09664v1
On Choosing the $μ$ Parameter in Gaussian Differential Privacy
Recent work argues for using Gaussian differential privacy (GDP) to report the privacy guarantees in privacy-preserving machine learning. We provide principled mappings from pure-DP $\varepsilon$ to GDP $μ$ by matching the worst-case success of a strong-adversary membership inference attack in terms...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09582v1
Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis
Motivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowl...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09558v1
Automating the Expert Eye: A System-Agnostic Deep Learning Framework for Rare Event Discovery in Imbalanced Force Spectroscopy
Single-Molecule Force Spectroscopy (SMFS) provides unprecedented insights into biomolecular mechanics, yet the high-throughput generation of force-extension trajectories creates a severe data curation bottleneck. Identifying rare molecular unbinding events within thousands of noise-dominated curves ...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09541v1
Efficient Traffic Prediction at Scale: A Systematic Study of STGCN Architectural Depth
Spatio-temporal graph neural networks (STGNNs) have become the dominant approach for traffic prediction, yet their computational requirements pose challenges for practical deployment in intelligent transportation systems (ITS). While recent work has proposed efficient alternatives to STGNNs, a funda...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09539v1
BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference
Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, fail...
📄 ResearchJun 8, 2026http://arxiv.org/abs/2606.09514v1