AI News Archive: June 1, 2026 — Part 20

Sourced from 500+ daily AI sources, scored by relevance.

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection
Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01896v1
Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations
Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging due to the irregular nature of sensor observations, characterized by asynchronous sampling, burst missingness, and temporal jitter. Compounding this issue, purel...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01894v1
EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors
Practical non-invasive Brain-Computer Interface (BCI) systems require EEG decoders with strong cross-subject generalization and minimal calibration. However, inter-subject variability and signal non-stationarity often entangle motor semantics with subject-specific noise, limiting subject-independent...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01884v1
Suppressing Forgery-Specific Shortcuts for Generalizable Deepfake Detection
Deepfake detection suffers from poor generalization across forgery methods, as existing models tend to rely on spurious method-specific shortcuts that fail to transfer to unseen manipulations. While recent approaches attempt to improve generalization, they lack an explicit mechanism to identify and ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01843v1
Physics-Guided Attention in a Lightweight TCN for Efficient WiFi CSI-Based Human Activity Recognition
Human Action Recognition (HAR) using WiFi Channel State Information (CSI) has gained increasing attention due to its non-contact, low-cost, and privacy-preserving nature. However, existing learning-based approaches largely rely on deep, computationally intensive architectures to implicitly capture m...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01834v1
Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation
Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01833v1
DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding
Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal kn...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.02091v1
Machine Learning for Coding Retail Product Names to Consumer-Price Categories: A Rule-plus-Bag-of-Words Pipeline with Reliability-Weighted Human-in-the-Loop Labeling
Consumer-price measurement increasingly draws on alternative data sources -- scanner, web-scraped, and transaction/receipt data. A recurring obstacle is that product descriptions in such sources are short, noisy, and abbreviated, with no standard product code, so each item must first be mapped to a ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.02004v1
CARTE: A Benchmark for Mapping Language Model Knowledge Across France
We introduce CARTE 1 (Culturally Anchored Regional-Territorial Evaluation), a multiplechoice benchmark for evaluating the ability of large language models (LLMs) to perform fine-grained reasoning over geographically grounded and regionally differentiated knowledge within France. While prior benchmar...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01995v1
Training Prompt Matters: State-Adaptive Optimization for Robust Fine-Tuning
While prompt engineering is instrumental in maximizing the capabilities of Large Language Models (LLMs) during inference, the role of prompts during training remains critically underexplored. Prevailing fine-tuning paradigms typically treat training prompts as mere surface forms, assuming that seman...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01967v1
What to Format and How: A Benchmark and Workflow Approach for Document Formatting
Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01936v1
Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time
Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01923v1
Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning
Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer o...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01914v1
CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs
Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}....
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01879v1
TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech
Fine-grained morphosyntactic error annotation is important in clinical and developmental language research, yet it is labour-intensive, expert-dependent, and difficult to scale. We present TalkTag, an LLM-based lightweight tool fine-tuned to automate CHAT-style error annotation in spoken-language tr...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01820v1
VaultSeek
Local AI search for your messy files
🧰 ToolsJun 1, 2026https://www.producthunt.com/products/vaultseek?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
CRAB-Bench: Evaluating LLM Agents under Complex Task Dependencies and Human-aligned User Simulation
Evaluating LLM agents in realistic service scenarios requires complex task dependencies, imperfect user behavior, and an evaluation that accommodates multiple valid solutions. We introduce CRAB-Bench (Constraint-based Realistic Agent Benchmark) and RUSE (Realistic User Simulation Engine) to address ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01815v1
Cost-Aware Diffusion Draft Trees for Speculative Decoding
Speculative decoding accelerates inference by having a lightweight drafter propose tokens verified in parallel by the target language model. Block diffusion drafters such as DFlash generate an entire draft block in one pass, yielding per-position marginals; DDTree uses these to build a candidate tre...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01813v1
ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference
Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01806v1
Multilinguality of Large Language Models From a Structural Perspective
Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01800v1
HarnessForge: Joint Harness and Policy Evolution for Adaptive Agent Systems
LLM agents are increasingly expected to operate across heterogeneous task regimes that require distinct execution paradigms. This challenges fixed agent systems and motivates system-level meta-adaptation beyond isolated component updates. While existing works have adapted external harness or trained...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01779v1
An Algebraic View of the Expressivity of Recurrent Language Models
What formal languages can a recurrent neural language model recognize? Formal results in the literature conflict: some authors report Turing-completeness, while others show equivalence to regular languages. The reason for this discrepancy is that the underlying arithmetic model differs. The paper de...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01765v1
TriAlign: Towards Universal Truth Consistency in Personalized LLM Alignment
Personalized large language models adapt responses to users' preferences and social attributes, but can introduce substantial universal truth inconsistencies across social groups, where some groups systematically receive less accurate responses on objective tasks. Existing alignment methods either i...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01755v1
Construction of Historical Knowledge Graphs Based on BERT and Graph Neural Networks
Through digital humanities research and scale-up historical data analysis, a significant amount of traditional historical text is converted into structured knowledge graphs. This paper provides a high-level architecture that combines bidirectional encoder representations of transformers (BERT) and g...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01747v1
THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models
Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, fa...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01738v1
Argument Collapse: LLMs Flatten Long-Form Public Debate
As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-argument...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01736v1
RCEM: Embedder Equipped with Query Rewriting Skill for Robust Conversational Search in Distributional Shift
Conversational search has become increasingly important in retrieval-augmented generation (RAG) systems, where users interact with AI assistants through multi-turn conversations containing context-dependent queries. We propose RCEM, a conversational dense retrieval model that distills the query refo...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01697v1
Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requir...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01682v1
Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification
Multimodal LLMs are increasingly used to assist scientific peer review, where a core requirement is verifying whether claims in a paper are supported by its evidence. Prior work has shown that models perform substantially better at this task when the evidence is a table than when it is a chart of th...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01679v1
Why Do Self-Harm Prediction Models Struggle to Generalise? Lexical and Semantic Variations in Emergency Department Triage Notes
Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown robust performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we comp...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01678v1
MobEvolve: An Agentic Self-Evolving Heuristic System for Interpretable Human Mobility Generation
Human mobility generation aims to synthesize realistic trip chains for target populations based on individual features. Existing paradigms, including deep generative models, LLM-based methods, and traditional heuristics, struggle to satisfy the complex demands of this task while simultaneously maint...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01640v1
SiteBlob
Spot AI vibecoding fingerprints before launch
🧰 ToolsJun 1, 2026https://www.producthunt.com/products/siteblob?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity
Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, bu...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01637v1
AlphaToken: Decoupling Adaptation and Stability for Path-Aware Response Token Valuation in LLM Post-Training
Token selection is pivotal for effective LLM post-training. However, existing methods mostly rely on local heuristics and rarely formulate token selection as a principled valuation of individual response tokens. We introduce $\textbf{AlphaToken}$, a response token valuation framework that decouples ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01635v1
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01629v1
Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents
Conversational tutoring agents have been shown to improve learning engagement and student outcomes, and large language models (LLMs) are increasingly used in these systems to provide scalable, personalized feedback. However, LLMs may perpetuate or amplify stereotypical social biases, posing particul...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01584v1
Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents
Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this introduces a new attack surface for agents to manage. We study two complementary directions for this threat. First, we evaluate guardian-based defenses: an interme...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01567v1
SentGuard: Sentence-Level Streaming Guardrails for Large Language Models
Large language models increasingly stream long, reasoning-intensive responses in real time, making when to moderate as critical as whether to moderate. Existing guardrails fall into two unsatisfactory extremes: response-level methods delay intervention until the full output is generated, whereas tok...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.02041v1
Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning
This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.02020v1
Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French
In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.02009v1
Scaling Agentic Capabilities via Grounded Interaction Synthesis
General agentic intelligence hinges on the ability to interact with diverse real-world tools to complete complex tasks, a capability fundamentally tied to the quality of interaction data. To bypass the prohibitive costs of human annotation, prevailing paradigms depend entirely on Large Language Mode...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.02001v1
Eyettention II: A Dual-Sequence Architecture for Modeling Fixation Location, Within-Word Landing Position, and Fixation Duration in Reading
The way our eyes move while reading provides valuable insights into both the reader's cognitive processes and the properties of the text. In particular, eye-tracking-while-reading data has shown to be highly beneficial in various technological applications, such as enhancing and interpreting languag...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01964v1
HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression
Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipeline...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01934v1
Mitigating Bias in Locally Constrained Decoding via Tractable Proposals
Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses seq...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01926v1
ContinuousBench: Can Differentially Private Synthetic Text Improve Capabilities?
Differentially private (DP) text synthesis promises to unlock sensitive corpora for model training, but it remains unclear whether DP synthetic data transmits genuinely new knowledge and capabilities present only in those corpora. This is because existing evaluations rely on tasks that are nearly so...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01849v1
When Meaning Travels: A Granular Lens on Hybrid-MoE's Role in Idiomatic Understanding for Language Models
In the contemporary epoch of multilingual education, learning idioms provides a fascinating gateway towards creativity, cultural values, historical context, and diverse perspectives inherent to various linguistic traditions. This paper showcases the navigation of retaining figurative and cultural se...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01671v1
EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision
Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iterative...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01617v1
My Period AI: Cycle Tracker
AI-powered period tracking for real cycles
🧰 ToolsJun 1, 2026https://www.producthunt.com/products/my-period-ai-cycle-tracker?utm_campaign=producthunt-api&utm_medium=api-v2&utm_source=Application%3A+the500feed+%28ID%3A+283491%29
RoboTrustBench: Benchmarking the Trustworthiness of Video World Models for Robotic Manipulation
Video world models are increasingly used in robotic manipulation, yet existing benchmarks mostly evaluate them under valid, feasible, and safe instructions. We introduce RoboTrustBench, a benchmark for evaluating the trustworthiness of video world models under four scenarios: Normal, Constraint-Sens...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.01600v1
Beyond $\ell_2$-norm and $\ell_\infty$-norm: A Curvature-Inspired $\ell_p$-Norm Scheme for Deep Neural Networks
The existing optimizers for deep neural networks (DNNs) typically rely on either the $\ell_2$ norm or the $\ell_\infty$ norm, resulting in optimizers that do not adapt well to substantial changes in curvature across parameter dimensions. Generally, the training process of DNNs often exhibits strong ...
📄 ResearchJun 1, 2026http://arxiv.org/abs/2606.02078v1