AI News Archive: June 11, 2026 — Part 16

Sourced from 500+ daily AI sources, scored by relevance.

PolyAlign: Conditional Human-Distribution Alignment
Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13227v1
When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates
Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce Se...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13218v1
Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization
Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN m...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13216v1
Understanding helpfulness and harmless tension in reward models
Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13209v1
SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection
Prompt-based LLMs are increasingly used for stance detection, but harder examples are not always repaired by clearer instructions, reasoning prompts, retrieval, or debate. We introduce SICI (Stance Inference Complexity Index), a seven-dimensional diagnostic measure of the semantic-pragmatic burden i...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13189v1
LAUKIN: A Multi-jurisdictional Common Law Contract Dataset
Multinational companies increasingly require cross-jurisdictional contract review, yet existing legal NLP datasets are largely restricted to a single jurisdiction. We introduce LAUKIN (Legal equivalence dataset of Australia, UK, and INdia), a dataset of clause pairs (AU-UK, UK-IN, IN-AU) labelled fo...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13184v1
MemRefine: LLM-Guided Compression for Long-Term Agent Memory
Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13177v1
Juno
AI Health Companion for Chronic Illness
🧰 ToolsJun 11, 2026https://www.producthunt.com/products/juno-13?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents
Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized rea...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13174v1
NTS-CoT: Mitigating Hallucinations in LLM-based News Timeline Summarization with Chain-of-Thought Reasoning
The rapid updates of online news make tracking event developments challenging, highlighting the need for timeline summarization (TLS). Hallucinations, where LLM-generated content deviates from source news, still remain a critical issue in LLM-based TLS and are not well studied in existing works. To ...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13171v1
From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation
The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13630v1
Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models
Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native struc...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13558v1
When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval
While mixed-language querying is ubiquitous in multilingual communities, the sensitivity of dense retrievers to such queries remains poorly understood. We present a ratio-controlled study on mMARCO that systematically evaluates retrieval performance by varying the mixing proportion of parallel query...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13537v1
RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue
The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not w...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13310v1
HyPE: Category-Aware Hypergraph Encoding with Persistent Edge Embeddings for Persona-Grounded Dialogue
Persona-grounded dialogue systems aim to produce responses consistent with a speaker's persona, yet existing methods treat personas as a flat set of sentences and fail to model the high-order relations among persona attributes-e.g., that several persona sentences share a topical category. We propose...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13142v1
Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction
We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normal...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13655v1
Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background
Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) using deep learning methods can assist humans in optimal waste management. R...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13587v1
Reinforcement Learning for Neural Model Editing
Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where ag...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13461v1
VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models
Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13460v1
Person Identification from Contextual Motion
We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authenticati...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13410v1
Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis
We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challen...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13341v1
HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers
Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is dri...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13289v1
An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors
Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-bas...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13136v1
PixelForge
Turn photos into game assets
🧰 ToolsJun 11, 2026https://www.producthunt.com/products/pixelforge-put-anyone-into-your-game?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation
Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, Conv...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13135v1
InterleaveThinker: Reinforcing Agentic Interleaved Generation
Recent image generators have demonstrated impressive photorealism and instruction-following capabilities in single-image generation and editing. However, constrained by their architectures, they cannot achieve interleaved generation (text-image sequence), which has crucial applications in visual nar...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13679v1
Modality Forcing for Scalable Spatial Generation
Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and invol...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13676v1
RepWAM: World Action Modeling with Representation Visual-Action Tokenizers
This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel r...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13674v1
World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible
Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometr...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13652v1
Surflo: Consistent 3D Surface Flow Model with Global State
Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent method...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13644v1
MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models
World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding an...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13515v1
SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale
This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as ...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13497v1
NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation
Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13494v1
VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits
Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai,...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13427v1
MoVerse: Real-Time Video World Modeling with Panoramic Gaussian Scaffold
We present MoVerse, a real-time video world model that creates an interactively navigable scene from a single narrow-field-of-view image. This setting is challenging because the input observes only a small fraction of the environment, while interactive roaming requires a complete surrounding world, ...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13376v1
Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception Optimization
The rate-distortion-perception (RDP) trade-off extends classical rate--distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13366v1
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, d...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13364v1
JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space
Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction durin...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13345v1
OR-Action: Multi-Role Video Understanding with Fine-Grained Actions
Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Convertin...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13332v1
Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI
Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remain...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13315v1
MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification
Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing mic...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13312v1
ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance
Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurat...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13304v1
DuET: Dual Expert Trajectories for Diffusion Image Editing
Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantiall...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13303v1
Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing
This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is tra...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13275v1
Towards More General Control of Diffusion Models Using Jeffrey Guidance
A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13240v1
Distributional Loss for Robust Classification
This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation i...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13223v1
Visual Place Recognition in Forests with Depth-Aware Distillation
Visual place recognition in natural forest environments remains challenging due to repetitive vegetation, weak structural cues, and significant appearance variation across traversals. To address this limitation, this paper proposes a lightweight depth-aware distillation framework that injects geomet...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13206v1
Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework
Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually c...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13188v1
Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback
Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13156v1
Understanding Truncated Positional Encodings for Graph Neural Networks
Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent...
📄 ResearchJun 11, 2026http://arxiv.org/abs/2606.13671v1