AI News Archive: June 2, 2026 — Part 22
Sourced from 500+ daily AI sources, scored by relevance.
- Does Language Shift Break Medical Vision-Language Models? Indonesian Radiology Visual Question Answering Case Study
Medical Vision-Language Models (VLMs) are typically evaluated on English radiology visual question answering benchmarks, leaving their robustness under non-English clinical language largely unexplored. We introduce IndoRad-VQA, an Indonesian adaptation of VQA-RAD, to assess whether medical VLMs reta...
- CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks
Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-sou...
- Vintapp
Snap your clothing, AI lists it on Vinted in 1 click
- Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability
Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuni...
- Building Reliable Long-Form Generation via Hallucination Rejection Sampling
Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon ...
- Beyond the Literal: Decomposing Pragmatic Intent in Multimodal Meme Understanding
When asked what a meme or sarcastic post means, Large Vision Language Models (LVLMs) tend to describe what the image shows rather than what the author is trying to communicate. Standard instruction tuning entangles a post's literal content with its pragmatic meaning, letting surface-level details co...
- World Models Meet Language Models: On the Complementarity of Concrete and Abstract Reasoning
World models and multimodal large language models (MLLMs) provide complementary capabilities for predicting future outcomes from static visual observations. World models can generate concrete visual rollouts of possible futures, while MLLMs can reason abstractly over questions, goals, and rules. How...
- CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery
Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of doma...
- AutoTail-BSFGM: Class-Balance-Aware Fine-Tuning for Chinese Scholarly Text Classification
Scholarly text classification supports literature organization, subject indexing, and research intelligence, but Chinese scholarly corpora often contain imbalanced and semantically adjacent disciplinary labels. We propose AutoTail-BSFGM, a class-balance-aware fine-tuning method that combines an auto...
- Can LLM Rerankers Predict Their Own Ranking Performance?
Retrieval effectiveness varies substantially across queries, making it important to estimate ranking quality before relevance judgments are available. Query performance prediction (QPP) addresses this need, but most existing methods rely on external predictors after retrieval or reranking. In this p...
- BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language
We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice reco...
- DMF: A Deterministic Memory Framework for Conversational AI Agents
Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opa...
- Large Language Models Are Overconfident in Their Own Responses
Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms...
- Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models
While large language models (LLMs) are increasingly used for clinical applications, many existing pipelines require sending raw sensitive health information to remote servers for processing, which heightens the risk of privacy leakage. A natural approach to mitigate this risk is to encrypt the data ...
- When Model Merging Breaks Routing: Training-Free Calibration for MoE
Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We ident...
- P\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization
Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm ha...
- Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics
Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary...
- Efficient ASR Training with Conversations that Never Happened
Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assemb...
- ZeroSev — Your Oncall copilot
By the time you are paged, the root cause is ready!
- A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026
We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strength...
- Value-Aware Stochastic KV Cache Eviction for Reasoning Models
Reasoning models improve accuracy through extended chains of thought, but their long outputs create a memory and compute bottleneck. KV cache eviction methods reduce this cost by evicting unimportant key-value pairs from the cache, yet they often yield worse accuracy than selection-based sparse atte...
- RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions
Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, ...
- Rethinking the Idiomaticity Decomposability Hypothesis: Evidence from Distributional Learning
Idioms can be analysed in terms of their decomposability, the extent to which constituent meanings contribute to the figurative whole. Decomposability is thought to predict syntactic flexibility. Usage-based accounts instead attribute idiom behaviour to distributional experience, such as speaker fam...
- Consistency Training Can Entrench Misalignment
Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify unde...
- Expert-Aware Causal Tracing of Factual Recall in Sparse MoE Language Models
Causal tracing of factual recall has been studied predominantly in dense transformer language models, where interventions localize information flow to layers or feed-forward modules. Sparse mixture-of-experts (MoE) language models introduce a sharper question: when a factual prediction is mediated b...
- Don't Forget Your Embeddings: Robust Knowledge Erasure via Precise Editing of Embeddings
As language models are increasingly deployed in real-world applications, the ability to erase specific knowledge from them becomes critical for safety and compliance. Prominent methods seek persistent removal by updating the model's parameters, yet the target knowledge often can be recovered through...
- SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents
Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly cons...
- Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models
Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as...
- SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems
Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when ...
- Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers
Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysi...
- A Benchmark for Semi-supervised Multi-modal Crowd Counting
This paper constructs the first benchmark on semi-supervised multi-modal crowd counting. To lay the foundation for this unexplored task, we first formulate the semi-supervised multi-modal setting and a standardized protocol that specifies the labeled-unlabeled data partition across different labeled...
- VidMsg: A Benchmark for Implicit Message Inference in Short Videos
Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains...
- TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics
Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors...
- SkelHCC: A Hyperbolic CLIP-Driven Cache Adaptation Framework for Skeleton-based One-Shot Action Recognition
Skeleton-based action recognition aims to understand human behaviors from body joint sequences and is especially challenging in the one-shot setting, where only a single labeled exemplar is available for each novel action. A key challenge is learning representations that capture the hierarchical and...
- BotLearn
#1 bot university, A2A community. Agents Learn, Humans Earn
- UnsOcc: 3D Semantic Occupancy Prediction in Unstructured Scene via Rendering Fusion
Unstructured scenes present unique challenges for autonomous driving, as irregular obstacles and sparse scene layouts undermine the effectiveness of traditional perception methods such as 3D object detection. 3D semantic occupancy prediction has emerged as a prominent focus due to its ability to pro...
- Eliciting Complex Spatial Reasoning in MLLMs through Wide-Baseline Matching
Wide-baseline matching (WBM) requires integrating geometric understanding, viewpoint changes, fine-grained perception, and occlusion reasoning, making it a challenging testbed for spatial reasoning in multimodal large language models (MLLMs) deployed in physical environments. However, current MLLMs ...
- Learned Non-Maximum Suppression for 3D Object Detection
Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among de...
- \textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation
Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-mod...
- Attend to Anything: Foundation Model for Unified Human Attention Modeling
Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generali...
- Structure-Guided Mixed Masked Pretraining and Spatial Continuity Regularization for Printed Circuit Board Defect Detection
Printed circuit board (PCB) defect detection is an essential part of automated optical inspection (AOI); yet it remains challenging in practice because many defects are tiny, low-contrast, and embedded in dense circuit backgrounds. To address these issues, this paper presents a two-phase PCB defect ...
- Low-Frequency Shortcuts in Texture-Driven Visual Learning
Neural networks suffer from shortcut learning, where learned features generalize well to the training set but not to in-distribution (ID) or out-of-distribution (OOD) test sets. Existing studies are all based on a few standard benchmarks, which are shape-driven. Numerous application domains, however...
- IDO: Incongruity-aware Distribution Optimization for Multimodal Fake News Detection
Multimodal fake news detection aims to identify the authenticity of news. Existing multimodal fake news detection methods mainly focus on cross-modal consistency, but often fail to explicitly model the semantic incongruity that characterizes deceptive multimodal content. However, misinformation ofte...
- A unified multi-task framework enables interpretable chest radiograph analysis
While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transfor...
- Cross-Modality Feature Fusion Based on Structured State Space Duality for Multimodal Image Registration Network
In multi-modal image registration, the primary challenge lies in shared structural information extraction. Compared to Transformers, Structured State Space Duality (SSD) offers greater global structural feature extraction with higher efficiency during training and inference. Inspired by these advant...
- IdEst: Assessing Self-Supervised Learning Representations via Intrinsic Dimension
Self-supervised learning (SSL) has emerged as a powerful paradigm for learning meaningful representations from unlabeled data. However, the standard protocol for evaluating these representations, linear probing, is computationally expensive, sensitive to hyperparameters, and provides limited insight...
- TASE: Truncation-Aware Semantic Embeddings for 3D Scene Understanding and Editing
High-fidelity semantic 3D scene representations are crucial for numerous applications, including robotics, autonomous driving, and simulation. Beyond this, the ability to edit such representations enables developers to adapt these applications more easily to specific target scenarios. Current approa...
- SagaQA: A Multi-hop Reasoning Benchmark for Long-form Narrative Understanding in TV Series
We introduce SagaQA, a long-form video benchmark for multi-hop reasoning over full-length TV series. Existing video reasoning benchmarks often emphasize local understanding of adjacent frames or clips. SagaQA addresses this gap by requiring high-level comprehension of extended multimodal narratives ...
- VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch
Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly ...
- Reinforcement Learning from Cross-domain Videos with Video Prediction Model
Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different ...