AI News Archive: May 26, 2026 — Part 13
Sourced from 500+ daily AI sources, scored by relevance.
- Thumbs.ai
Thumbs.ai
- machineko
machineko
- Vmaker AI
Vmaker AI
- friend2chat
friend2chat
- UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-po...
- Constitutional Arms Races in the Public Goods Game: Co-Evolving LLM Constitutions Under Cooperation-Defection Pressure
Frontier LLM agents engage in blackmail, sabotage, and document leaks under goal conflicts in agentic settings, exposing limitations of alignment methods built around single-agent or cooperative assumptions. Recent work shows LLM-guided evolutionary search can discover effective cooperative constitu...
- Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study
Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, del...
- Atari Games Challenge: A Pilot Study on Multimodal Player Experience Assessment
We present a pilot study on the collection and synchronisation of multimodal data for player experience investigation. We collected game telemetry, self-reported surveys, biometrics, and cued-retrospective think-aloud (C-RTA) data from 19 participants playing three Atari 2600 games. The study then u...
- Granuscore: A Reference-Free Measure of Granularity for Text Analysis and Question Answering
Natural language conveys information at varying levels of granularity, from fine-grained references to broad descriptions. While granularity is fundamental to human communication, existing measures mostly capture surface detail or sentence specificity. We introduce Granuscore, a reference-free measu...
- Rethinking AI Psychosis: Misnomers, Conceptual Limits, and Existential Drift
There has been a proliferation of media reports about so-called AI psychosis in the last year. Not surprisingly, this has prompted growing academic work on the ways in which AI chatbots such as ChatGPT, Claude, and Replika might aggravate or even induce psychosis, typically understood in terms of us...
- Landseer: Exploring the Machine Learning Defense Landscape
Machine learning systems face diverse threats that undermine robustness, privacy, and fairness. Although many defenses have been proposed, each typically addresses a single risk in isolation. Real-world deployments, however, require these defenses to be composed to meet multiple guarantees simultane...
- Practical Anonymous Two-Party Gradient Boosting Decision Tree
Structured data is well handled by gradient-boosted decision trees (GBDT), which are usually trained on vertically partitioned features across mutually distrustful parties. High speed and interpretability make GBDTs popular in finance and healthcare, where neural networks may fall short. Enabling se...
- Privacy-Preserving Screening for Record Linkage
In an era dominated by big data and machine learning, establishing valuable data collaboration has never been more critical. However, such collaborations must operate under regulatory and legal constraints. Two-party Privacy-Preserving Record Linkage (PPRL) emerges to assess the potential collaborat...
- Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control
Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning where adversarially optimized documents manipulate generated outputs. Existing defenses assume that detecting poisoned evidence prevents harm. We show this assumpt...
- Cordyceps: Covert Control Attacks on LLMs via Data Poisoning
Large language models (LLMs) are often fine-tuned on uncurated text datasets that adversaries can poison. Existing poisoning attacks primarily rely on fixed trigger phrases that defenses such as outlier detection, clean-data regularization, or online monitoring can neutralize. In this paper, we prop...
- GradSentry: Gradient Spectral Entropy for Backdoor Sample Filtering in Large Language Model Fine-Tuning
Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient ...
- SEC-bench Pro: Can Language Models Solve Long-Horizon Software Security Tasks?
Large language models (LLMs) now support automated software security tasks, including vulnerability discovery and proof-of-concept (PoC) generation. Existing benchmarks do not faithfully evaluate LLMs in real-world bug hunting scenarios because they rely on fuzzing harnesses, target-specific descrip...
- Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model. Yet, pretrained LLMs already encode...
- Aligning Provenance with Authorization: A Dual-Graph Defense for LLM Agents
LLM-based agents are increasingly deployed in high-stakes scenarios such as email management, financial transactions, and code execution, where they interact with the external world through tool calling. During execution, these agents must read external data sources (emails, webpages, files) that at...
- ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation
Tool-using agents increasingly operate in open-ended deployment environments, where they compose file systems, web APIs, code interpreters, and enterprise services at runtime. This creates a safety gap in tool composition: an agent can satisfy every per-tool permission check and still produce an uns...
- Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models
Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In this paper, we formalize the behavioral geometry of a population of mo...
- EviACT: An Evidence-to-Action Framework for Agentic Program Repair
LLM-based agents have moved automated program repair (APR) from fixed-context patch generation to interactive repository-level repair. However, existing agentic APR systems still struggle to use execution evidence to guide localization, patch generation, and validation. We propose EviACT (Evidence-t...
- Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
LLMs deployed in high-stakes domains face fundamental reliability challenges: hallucinations, inconsistencies, and privacy vulnerabilities introduce unacceptable risks where errors carry legal, financial, or safety consequences. This paper presents a hybrid verification architecture combining formal...
- Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton
Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principl...
- LLM-based Mockless Unit Test Generation for Java
Large language models (LLMs) have shown strong potential for automated test generation, yet most approaches to generating Java unit tests still rely on mocking frameworks to handle dependencies. Mockless test generation could exercise more real low-level code, but it faces challenges such as invalid...
- HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HT...
- TrajAudit: Automated Failure Diagnosis for Agentic Coding Systems
Agentic systems have been widely studied to automate software engineering jobs such as bug fixing. As these systems increasingly tackle complex tasks, understanding where and why they fail becomes essential for iterative refinement and operational reliability. Existing automated failure diagnosis ap...
- Testing Agentic Workflows with Structural Coverage Criteria
Multi-agent systems increasingly expose explicit workflow structure: agents, tools, tool-access rules, restrictions, and delegation paths. Existing evaluations rely largely on end-to-end task success, benchmark scores, final-response quality, or prompt-level checks, which provide limited evidence th...
- Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization
AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal ...
- CFMDCTCodec: A Low-Bitrate Neural Speech Codec with Noise-Prior-aware Conditional Flow Matching for MDCT-Spectral Enhancement
High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neural speech codec tha...
- Rethinking Agentic RAG: Toward LLM-Driven Logical Retrieval Beyond Embeddings
Recent advances in RAG have shifted toward an agentic paradigm, where LLMs interact with retrieval systems over multiple turns and iteratively refine queries based on intermediate results. At the same time, LLMs have demonstrated a strong ability to construct structured queries that precisely expres...
- Lost in the Evidence? Reproducing Document Position and Context Size Effects in RAG
Retrieval-Augmented Generation (RAG) systems rely on retrieved documents being concatenated into a model's input context, making both document ordering and context size critical yet controversial design choices. Prior work reports position-based effects such as lost in the middle and related long-co...
- MuChator: Enabling Active Music Discovery via Conversational Music LLMs in Douyin Music
Douyin Music, a large-scale platform with millions of daily users, adopts an immersive, feed-based discovery paradigm, where users passively explore music through continuous recommendations. While effective for passive music discovery, this paradigm restricts users to recommendation results and prov...
- The 2nd EReL@MIR Workshop on Efficient Representation Learning for Multimodal Information Retrieval
Multimodal representation learning has attracted increasing attention in AI, driven by the strong performance of large, pretrained multimodal foundation models such as Qwen, LLaVA, and CLIP. These models deliver impressive performance on a range of multimodal information retrieval (MIR) tasks, inclu...
- L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation
Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specific preferences while effectively leveraging both behavioral and semantic signals. Existing approaches typically integrate these signals at either the input level...
- FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing
Retrieval-Augmented Generation (RAG) has become critical for knowledge-intensive applications, yet evaluating its performance in vertical domains remains difficult due to domain complexity, diverse context scales, and heavy reliance on expert assessments that are costly, inconsistent, and non-scalab...
- Plans for Evaluating Structured Generative Search Summaries
We propose a framework for evaluating structured generative search summaries that are placed atop organic web search results. A structured summary, generated by a large language model, typically consists of an overview, several sections with section titles, and a list of source documents that are ci...
- ICICLE: Expanding Retrieval with In-Context Documents
Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes corpus expansion costly: adding new documents requires updating model parameters to encode new document-docid associations incurs repeated training and catastrophic...
- RAGEAR: Retrieval-Augmented Graph-Enhanced Academic Recommender
We present RAGEAR (Retrieval-Augmented Graph-Enhanced Academic Recommender), a neurosymbolic recommender system for academic course recommendation. RAGEAR combines dense retrieval over full lecture transcripts with a symbolic Knowledge Graph modelling courses, lessons, transcript chunks, credits, st...
- Is Position Bias in Dense Retrievers Built In-or Learned from Data?
Dense retrievers exhibit positional bias, favoring documents whose query-relevant information appears near the beginning and degrading retrieval performance when the information appears later. While prior work on positional bias in dense retrievers has largely focused on architectural explanations, ...
- Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation
With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, scor...
- Beyond the Billion: Dose-Response Immunophenotyping and Machine Learning Classification of Live versus Heat-Inactivated Gram-Positive Probiotic Strains in Human Peripheral Blood Mononuclear Cells
Probiotic research is constrained by three pervasive yet insufficiently challenged assumptions: the requirement for a minimum of one billion colony-forming units for efficacy, the necessity for gut colonization, and the inherent superiority of live over inactivated preparations. This study addresses these gaps through a fully factorial experimental design evaluating ten Gram-positive probiotic strains in both viable (Active Fluorescent Units, AFU) and heat-inactivated (Total Fluorescent Units, TFU) forms across three flow cytometry-verified concentrations (105, 106, 107 cells/well per ISO 19344:2015) in primary human peripheral blood mononuclear cells (PBMC) from a single healthy male Caucasian donor (58 years), with simultaneous quantification of 17 cytokines by BioPlex suspension array. Viable preparations induced profoundly greater absolute cytokine responses than heat-inactivated preparations across 14 of 17 analytes, heat-inactivated preparations demonstrated stronger dose-response correlations (mean within-strain Spearman {rho} up to 1.00) for 13 of 17 cytokines, a finding we attribute to the uncontrolled proliferation of live bacteria during 24-hour co-culture compressing the effective concentration range. Six of ten viable strains exhibited monotonically increasing profiles; two strains displayed non-monotonic bell-shaped kinetics with peak activity at 106 AFU/well and significant attenuation at 107, directly falsifying the assumption that dose escalation uniformly increases immunological activity. MCP-1 was the sole cytokine showing no significant difference between viability states (p = 0.61, fold-change 1.1), providing an internal methodological control. In this single-donor model, unsupervised hierarchical clustering identified three immunological phenotype clusters, requiring multi-donor validation before these groupings can be treated as generalizable biological phenotypes, with Random Forest classification achieving 86.7% internal partition-recovery consistency (clusters derived from the same data; not an estimate of generalization to novel strains) versus 33.3% chance; In this single-donor experiment, IL-13, IL-12p70, and IFN-{gamma}, not IL-6 or IL-1{beta}, were the primary discriminators of strain identity; generalizability of this ranking requires multi-donor validation. Heat-inactivated preparations achieved [≥]70% functional equivalence relative to viable preparations at 107 TFU/well for the majority of responsive strains (Functional Equivalence Dose, FED70), while one strain remained immunologically inert in heat-inactivated form across all concentrations, a finding subject to the caveat that no positive control stimulus was included to formally verify PBMC functional competence on the experimental day. These findings establish a methodological framework integrating flow cytometric standardization, multiplex immunophenotyping, and machine learning for evidence-based dose characterization, postbiotic functional equivalence assessment, and data-driven strain classification in probiotic research (all p-values are descriptive within a single-donor experimental context).
- Longitudinal Awake Mouse Brain Imaging Using Functional Ultrasound and Functional Ultrasound Localization Microscopy
Ultrafast ultrasound offers a unique route to cross-scale neurovascular phenotyping by integrating functional ultrasound (fUS), ultrasound localization microscopy (ULM), and functional ULM (fULM). Yet the baseline variability, longitudinal stability, and biological safety of such multimodal imaging in awake animals remain insufficiently defined, limiting its use for detecting subtle disease-associated neurovascular changes. Here, an awake longitudinal fUS-ULM-fULM framework is established and validated in mice over five months. Structural vascularity, microvascular flow velocity, mesoscale hemodynamic responses, and microvascular functional responses are repeatedly quantified in the same animals during monthly imaging sessions. Across all metrics, no significant longitudinal drift is detected (p > 0.60). Structural and flow-derived measures are markedly more reproducible than functional readouts, with within-subject coefficients of variation of 5.1% for mean flow velocity and 7.3% for vascularity, compared with 25.0% for fUS-derived cerebral blood volume responses and 53.2% for fULM-derived microvascular functional responses. Mean flow velocity shows the strongest longitudinal consistency (ICC = 0.70) and the lowest detection threshold. Behavioral testing and GFAP/Iba1 staining further reveal no memory impairment or chronic neuroinflammation. This study defines quantitative baselines, reproducibility limits, and safety evidence for awake cross-scale ultrasound imaging, providing a reference framework for longitudinal neurovascular phenotyping in preclinical disease models.
- SCALLOPS: a scalable, integrated computational framework for Optical Pooled Screens
Optical pooled screens (OPS) link pooled genetic perturbations to high-dimensional image-based phenotypes at scale, but their widespread adoption is hindered by computational bottlenecks in processing terabyte-scale, multimodal image data. We present SCALLOPS, a unified, modular, and cloud-native computational framework that overcomes these bottlenecks. SCALLOPS implements a "well-centric" processing strategy that integrates robust stitching with a non-linear two-stage registration strategy, enabling accurate alignment of multi-magnification images, reliable single-cell genotype, phenotype linkage, and efficient morphological feature extraction. Benchmarking with public and newly-generated datasets demonstrated SCALLOPS' superior performance over existing solutions. Crucially, SCALLOPS uniquely enables robust processing of 4x magnification in situ sequencing data, accelerating image acquisition by around six-fold. We applied SCALLOPS to an optical pooled screen investigating the estrogen receptor (ER) degrader vepdegestrant in a breast cancer cell line, successfully recovering its known mechanism of action, highlighting the value of OPS in translational research. SCALLOPS provides a scalable end-to-end solution, making large-scale OPS routine.
- Comparative Study on Image Quality of Deep Learning and Adaptive Statistical Iterative Reconstruction-V in Thin Layer CT of liver Lesions
Objective:This study aims to compare the advantages and disadvantages of DLIR and adaptive statistical iterative reconstruction-V (ASIR-V) in thin-slice (2.5 mm) CT images of hepatic lesions characterized by high and low contrast. Additionally, the study seeks to determine the optimal DLIR strength for the evaluation of liver lesions. Methods:A retrospective analysis was performed on 90 patients who underwent abdominal contrast-enhanced CT scans. Group A comprised 48 patients with low-contrast lesions, while Group B included 42 patients with high-contrast lesions. The acquired images were reconstructed using post-processing DLIR at low (DLIR-L), medium (DLIR-M), and high (DLIR-H) strengths, all with a slice thickness of 2.5 mm (subgroups A1-A3, B1-B3). Furthermore, images were reconstructed with ASIR-V at 50% strength at slice thicknesses of 2.5 mm and 5 mm (subgroups A4/B4 and A5/B5, respectively). CT values and standard deviations (SD) of the liver and lesions were measured, and the corresponding signal-to-noise ratio (SNR) and contrast-to-noise ratio (CNR) were calculated. The edge rise slope (ERS) was determined using ImageJ software by measuring CT values along a line from the liver parenchyma to the lesion. Objective metrics were compared using one-way ANOVA, with independent samples t-tests applied for inter-group differences. Subjective scoring, which encompassed noise level, diagnostic confidence, and lesion margin delineation, was conducted by two radiologists, with differences analyzed using the Kappa test. Results: Objective evaluation revealed a progressive decrease in lesion SD and a progressive increase in SNR and CNR from subgroups A1/B1 to A3/B3. The SD of Group A2 decreased by 57.4% compared to A4, while the SNR and CNR of A2 icreased by 19.3% and 24.6% compared to A4. Although subgroup B2 had a lower SNR than B5, the difference was not statistically significant. SNR and CNR in B2 increased by 24.1% and 11.9%, respectively, compared to B4. ERS gradually decreased from A1/B1 to A3/B3. ERS values in A2 and B2 increased by 27.0% and 39.4%, respectively, relative to A5 and B5. Although A3 had a lower ERS than A1 and A2, all DLIR subgroups exhibited higher ERS than A5; similar trends were observed in Group B. Subjective evaluation indicated good inter-reader agreement (Kappa > 0.61, p < 0.05). As DLIR strength increased, noise scores rose progressively in both groups. However, noise in A2 and B2 was lower than in A4/A5 and B4/B5. Diagnostic confidence and lesion margin delineation scores were highest in A2 and B2, while all subjective scores were lowest in A5 and B5. Discussion: Most prior studies evaluated the liver, vessels, or confirmed that image quality can be guaranteed at low doses. However, there are few studies on specific individual lesions. Therefore, this study aims to investigate specific individual lesions. The details and detection rate were analyzed separately to confirm the clinical acceptability of 2.5-mm DLIR image in different contrast lesions. Conclusion: For both high- and low-contrast hepatic lesions, DLIR provides superior image quality compared to ASIR-V, with the 2.5mm DLIR-M setting being optimal. DLIR-M reduces image noise, improves spatial resolution, and produces images more suitable for diagnostic purposes.
- Randomised Trial of a Multilingual Conversational AI for Preoperative Education
Background Informed consent depends on patients' understanding of anaesthesia risk, yet comprehension remains poor despite routine preoperative consultation. Conversational artificial intelligence (AI) could establish patient-reported understanding before clinician contact, but whether such systems can achieve patient-reported understanding comparable to clinician-delivered education remains unknown. Methods We conducted a randomised equivalence trial (n = 130) of PEAR (Preoperative Education of Anaesthesia Risks), a multilingual retrieval-augmented conversational AI grounded in institutional consent materials, versus standard preoperative consultation in adults undergoing elective surgery. Results A total of 130 adults (mean age 52.4 +/- 14.5 years) were enrolled. Post-consultation understanding scores in the PEAR group met the pre-specified equivalence criterion compared with standard consultation across all three primary measures. Patients who interacted with PEAR before clinician contact achieved understanding scores comparable to those receiving standard face-to-face consultation alone. PEAR reduced documentation and consultation time, corresponding to a projected annual net benefit of approximately SGD 0.99 million (USD 0.78 million) at a single tertiary centre. Conclusions A retrieval-augmented conversational AI achieved patient-reported understanding of anaesthesia risk equivalent to standard preoperative consultation while substantially improving workflow efficiency. These findings support supervised deployment of conversational AI within perioperative care pathways while preserving clinician oversight for verification and patient-specific decision-making.
- A Consensus-Driven Stacking Ensemble Framework for Interpretable Cardiovascular Risk Prediction and Clinical Deployment
Machine learning (ML) is being considered to help diagnose cardiovascular diseases (CVD). Still, challenges like inconsistent and limited datasets, limited infrastructure, and global inequalities lead to the need for a reliable and practicable ML solution. This paper presents an ML-driven framework for predicting CVD risk scores and classifying status. Several data preprocessing techniques, including multiple imputation by chained equations (MICE), outlier removal, are considered. In addition, hyperparameter tuning is performed with the GridSearchCV tuning technique. Moreover, a consensus-driven five-feature selection method is applied to identify optimal predictors. The dataset used in this study contains healthcare records related to future CVD risk scores, comprising 1,529 patient records with 22 features. The optimized stacked ensemble model is applied to the dataset and achieves a cross-validated coefficient of determination value of 98.13% for CVD risk score regression. Comparative evaluation with other ML models confirmed improved accuracy, efficiency, and interpretability. The explainable AI technique SHAP is applied to interpret predictions and highlight key risk factors. Moreover, a deployment-ready web platform with multi-role access has been developed that demonstrates clinical applicability. The proposed framework offers a reliable and interpretable tool for early detection of CVD and personalized risk assessment. In the future, this work can be extended to integrate longitudinal data, medical imaging, and deep learning to improve generalizability and strengthen real-world impact.
- Automated quantification of cerebral microbleeds for ARIA-H monitoring in Aging and Alzheimer's Disease: A multicenter deep learning validation
We trained a self-configuring nnU-Net model for CMB segmentation in a heterogeneous multicenter sample (n=264), including 1.5T and 3T field strengths, SWI and T2*-GRE sequences, and community and clinical cohorts. Model performance was evaluated using 5-fold cross-validation with a focus on object-level detection metrics. Real-world performance was evaluated on scans from an unseen dataset of people with cerebrovascular disease (n=20). The model achieved 0.82 cluster Dice, 0.88 precision, and 0.77 sensitivity on hold-out test data. Notably, the model demonstrated a low false-positive rate, averaging 0.58 false positives (FPs) per scan, an improvement on existing publicly available models. The model achieved high performance in dataset of those with Alzheimer's disease and mild cognitive impairment (0.89 cluster Dice, 0.94 sensitivity), supporting its utility in clinical settings where ARIA-H monitoring is critical. In external validation, the model maintained high robustness with 0.79 sensitivity and 0.95 FPs per scan. By leveraging a heterogenous training strategy and a self-adapting architecture, we demonstrate that deep learning can achieve high-precision CMB detection that is robust to domain shifts. The low FP rate suggests this publicly available pipeline is suitable for automated screening and lesion counting in heterogenous large-scale clinical trials, reducing the burden of manual quantification.
- Frontier Large Language Models for Comprehensive Medication Review in CKD Patients with Polypharmacy: A Trap-Embedded Synthetic Benchmark
Background: Patients with CKD and polypharmacy face high rates of drug-related problems, yet comprehensive medication review remains time-intensive and inconsistently performed. Large language models (LLMs) may augment this process, but existing benchmarks use multiple-choice formats that do not reflect open-ended, nephrology-specific review. We developed a trap-embedded synthetic CKD benchmark and evaluated five current-generation LLMs (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 4.1 Fast, DeepSeek R1; tested April-May 2026) for open-ended medication review. Methods: Fifty synthetic CKD cases across three complexity groups (G3a-G3b [n=20], G4 [n=15], G5/G5D/transplant [n=15]) with 8-12 medications and [≥]2 embedded clinical traps each were scored against nephrologist-adjudicated gold standards. Each model produced three independent responses per case (temperature 0; 750 total outputs). Primary endpoint was per-case macro F1; secondary endpoints were safety-critical omission rate, PI-adjudicated hallucination rate, and intra-model consistency. Blinded inter-rater reliability for gold-standard item detection was assessed on a 30% sample. Results: Consensus-level macro F1 ranged from 0.41 (Claude Sonnet 4.6) to 0.49 (Grok 4.1 Fast) (Friedman P < 0.001). Phosphate binder timing (11%) and hyperkalemia combinations (33%) were poorly detected across all models. Safety-critical omission rate ranged from 22% to 48% (P < 0.001); PI-adjudicated hallucination ranged from 0% (GPT-5.4) to 54% (DeepSeek R1), including fabricated dose caps and non-existent guideline citations. Blinded reliability for gold-standard item detection was high (kappa = 0.934, n = 92). Conclusions: This nephrology-specific benchmark exposes clinically important LLM blind spots that generic multiple-choice evaluations would not detect. Heterogeneous hallucination and omission rates indicate that model selection and domain-specific guardrails should precede any clinical deployment of LLM-assisted CKD medication review. Prospective validation with real patient data and human comparators is required before deployment recommendations can be made.
- Micron closes in on $1 trillion market value as UBS triples share price target
Micron closes in on $1 trillion market value as UBS triples share price target Reuters