AI News Archive: June 10, 2026 — Part 19
Sourced from 500+ daily AI sources, scored by relevance.
- CCKS: Consensus-based Communication and Knowledge Sharing
In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teach...
- Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework
Automating compliance check for geometry-intensive regulations remains a significant technical bottleneck in Building Information Modeling (BIM), primarily due to the semantic disparity between high-level regulatory logic and structured IFC data. Existing methods, often reliant on static rule templa...
- Sovereign Assurance Boundary: Certificate-Bound Admission for Agentic Infrastructure
Agentic infrastructure introduces a critical control-plane authorization problem: non-deterministic reasoning systems can propose high-stakes mutations to production resources, yet existing security mechanisms -- such as identity and access management (IAM), policy engines, consensus protocols, and ...
- Identifying cybersickness causes in virtual reality games using symbolic machine learning algorithms
Virtual reality (VR) and head-mounted displays are constantly gaining popularity in various fields such as education, military, entertainment, and health. Although such technologies provide a high sense of immersion, they can also trigger symptoms of discomfort. This condition is called cybersicknes...
- Understanding and Supporting Online Discussion with Opinionated Chatbots
Opinionated chatbots are increasingly present on online platforms and have the potential to shape public discourse by influencing individuals' viewpoints before they engage in discussions. Despite their growing presence, the impact of interacting with opinionated chatbots on subsequent online intera...
- Learning by Chatting? Investigating the Impact of Generative AI on Information Seeking and Learning
Generative AI (GenAI) tools offer increasing opportunities for augmenting human cognitive tasks. Among these tasks, information seeking is being rapidly reshaped by GenAI tools, with potentially profound implications for learning and knowledge acquisition. To investigate these implications, we condu...
- OCELOT: Inference-Leakage Budgets for Privacy-Preserving LLM Agents
Large language model (LLM) agents increasingly act on a user's behalf -- reading personal files, calling tools, transacting with external services -- possibly leaking personally identifiable information (PII) across trust boundaries at every step. Privacy here is a property not of a single output bu...
- Selection Integrity for LLM Graph Memory: An Accumulability Criterion for Information-Flow-Blind Retrieval
Agent memory is moving to graphs, and the provenance defenses now being built for it all check one thing: the provenance of the records an agent retrieves. We show that this entire class of defense is blind by construction. A long-term graph memory runs a global selection step over writable graph st...
- Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization
Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by ...
- Mind your key: An Empirical Study of LLM API Credential Leakage in iOS Apps
The rapid integration of large language models (LLMs) into mobile applications has introduced a new class of credential security risk: leaked credentials that grant unauthorized access to LLM inference services, causing financial damage to developers. Prior work on credential leakage has focused pri...
- Systematic Cybersecurity Risk Analysis of European Rail Traffic Management System
European Rail Traffic Management System (ERTMS) is a widely adopted standard unifying train management in the EU. While the standard allows for use cases like fully autonomous driving, cybersecurity has been an afterthought. Risk analysis enables the systematic assessment and prioritization of threa...
- Can Open-Source LLM Agents Replace Static Application Security Testing Tools? An Empirical Assessment
This paper explores the value of agentic AI tools for cybersecurity purposes. We evaluate the efficacy of a general-purpose GenAI Large Language Model- (GenAI-) based agent when powered by three different Ollama-hosted general-purpose open source models. We assess each agent's performance using prec...
- Runtime Skill Audit: Targeted Runtime Probing for Agent Skill Security
Agent skills let LLM agents reuse instructions, resources, tools, and workflows, but they also create a new place for malicious behavior to hide. A skill may look benign in its documentation or code while becoming harmful only when it is invoked with particular user requests, local assets, persisten...
- Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks
The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating targ...
- Optimisation of steatotic liver disease screening algorithm for resource-poor settings using machine learning
Background The European Association for the Study of the Liver (ESAL) - Steatotic Liver Disease (SLD) screening algorithm involves two steps; initial screening with FIB-4 followed by referral for vibration-controlled transient elastography (VCTE) in patients likely to have significant fibrosis (SF). However, VCTE is not widely available in resource-limited settings. Aim To optimise the EASL SLD screening algorithm for resource-poor settings using machine learning (ML). Methods We analysed data from 964 adults aged [≥]35 years who underwent VCTE at a tertiary referral centre in Sri Lanka between November 2024 and 2025. Multiple ML models using different methods and variable combinations were trained on 80% of the dataset and tested on the remaining 20%. Best models were selected based on performance and externally validated using data from 430 patients who underwent VCTE before November 2024. Model performance was compared with the FIB-4 using confusion matrices. Results A Random Forest model incorporating age, AST, ALT, and platelet count separately, rather than using FIB-4, outperformed. The all-variable ML model showed the best predictive performance for SF, with accuracy of 77.2%, recall of 0.762, precision of 0.778, and AUC-ROC of 0.818. The variables used in the model, in descending order of feature importance, were AST, platelet count, BMI, ALT, age, diabetes mellitus, hypertension, dyslipidaemia, sex, family history, hypothyroidism, diabetes complication and smoking. External validation demonstrated 75.1% accuracy and an AUC of 0.779. When used as the first step of the SLD screening algorithm, the all-variable ML model identified 37 (17.1%) additional true positives and reduced false-negative diagnoses by 50% compared with FIB-4. Conclusions ML-based models were more effective than the FIB-4 score as the first-line screening tool for VCTE referral, substantially improving the identification of patients with significant fibrosis in this South Asian cohort.
- Defense Against Prompt Inversion Attacks: An Information-Theoretic Approach for LLM Collaborative Inference
Collaborative edge-cloud inference enables resource-constrained devices to leverage large language models (LLMs) by offloading partial computation to cloud servers. However, transmitting intermediate activations exposes sensitive user prompts to prompt inversion attacks, where an adversary reconstru...
- A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents
Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access control, data-loss prevention, perimeter inspection -- governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context...
- Bridging the Smart City Cybersecurity Data Gap Through AI-Driven Synthetic Dataset Generation
Smart cities rely on interconnected cyber-physical systems that integrate sensors, IoT devices, cloud platforms, and AI-driven services and decision-making. While these systems enhance city services, they also introduce complex cybersecurity challenges due to their large attack surfaces, heterogeneo...
- Jaguar: Fast Private CNN Inference with Power-of-Two Homomorphic Arithmetic
Hybrid HE/2PC private CNN inference remains bottlenecked by prime-modulus homomorphic arithmetic in convolution and by a precision flow that runs ReLU at doubled bitwidth before invoking a separate truncation protocol. We present Jaguar, a system built on a single design choice--a power-of-two ciphe...
- SwarmSense-DNN: A Trustworthy and Decentralized Neural Framework for Proactive Anomaly Defense in Consumer IoT
The rapid growth of consumer IoT devices has introduced unprecedented challenges in trustworthy anomaly detection against AI-enabled cyber threats, requiring real-time, privacy-preserving, and scalable defense mechanisms. Traditional centralized strategies face critical limitations, including commun...
- T2S: A Rehearsal-Based Approach for Extraction-Resistant Model Watermarking
Model watermarking safeguards AI model intellectual property by embedding distinctive knowledge that induces unique behavioral signatures. The primary technical challenge lies in ensuring watermark robustness against various post-processing attacks on the watermarked model. Model extraction attacks ...
- Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices
Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and de...
- Rule Taxonomy and Evolution in AI IDEs: A Mining and Survey Study
The adoption of AI-powered Integrated Development Environments (AI IDEs) has introduced "Rules" as a novel software artifact, allowing developers to persistently inject project-specific constraints and architectural guidelines into the context of Large Language Models (LLMs). Despite their role in a...
- Enhancing LLM-Based Code Translation with Verified Multi-Semantic Representations
Large language models (LLMs) have shown great promise for automated code translation, yet existing approaches often rely on token-level statistical patterns rather than sufficient understanding of program semantics. As a result, translated programs may still contain logical and semantic errors. Alth...
- Acoda: Adversarial Code Obfuscation for Defending against LLM-based Analysis
With the widespread adoption of Large Language Models (LLMs) in software engineering (SE) tasks such as code understanding, debugging, and vulnerability detection, their powerful semantic reasoning ability has also introduced new security and privacy risks. LLMs can analyze, reconstruct, or even rev...
- SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior
Agent Skills augment large language model (LLM) agents with procedural knowledge at inference time, but current benchmarks rarely distinguish what a Skill says from how it is organized. We study this distinction through Progressive Disclosure, where a concise root file points agents to supporting re...
- Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation
Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics...
- Tight Boundary Prediction in Speaker Diarization Using Causal-Anticausal Consistency
Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to int...
- Benchmarking Neural Speech Compression from a Rate-Distortion Perspective
Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability mode...
- HALO: Half-Frame-Rate Adaptive Learnable Operator for Lightweight STFT-Based Speech Enhancement
STFT-based speech enhancement typically adopts overlapping analysis frames. While overlap is essential for stable STFT processing, it makes adjacent frames highly correlated, causing redundant computation in lightweight models. We propose Half-frame-rate Adaptive Learnable Operator (HALO), a causal ...
- Transcriptomic Architecture of Type 2 Diabetes in Human Pancreatic Islets:An Integrative Meta-Analysis and Machine Learning Framework for Biomarker Discovery
Background. Type 2 diabetes mellitus (T2D) is defined by progressive pancreatic {beta}-cell dysfunction whose molecular underpinnings remain incompletely understood. Single-cohort transcriptomic analyses of donor islets have yielded heterogeneous gene lists of limited cross-study reproducibility, constraining both mechanistic interpretation and biomarker development. Methods. We combined two complementary analytical strategies applied to four public human islet transcriptomic cohorts (GSE25724, GSE20966, GSE38642, and GSE164416; n = 7-57 donors per contrast). For the integrative arm, three microarray datasets and one bulk RNA-seq dataset were processed independently and unified through gene-level random-effects meta-analysis, hallmark pathway scoring (GSVA/MSigDB), and iterative module refinement, yielding a two-axis disease framework. For the diagnostic arm, a consensus multi-method machine learning pipeline, combining LASSO penalized logistic regression, Support Vector Machine Recursive Feature Elimination (SVM-RFE), and Random Forest importance scoring, was applied to 184 differentially expressed genes from the RNA-seq cohort, with all normalization steps performed within leave-one-out cross-validation (LOOCV) folds to prevent data leakage. Machine learning classification of the RNA-seq cohort was additionally subjected to external transportability testing in the independent bulk human islet RNA-seq cohort GSE50244 using an overlap-restricted reduced score and a threshold fixed in the discovery cohort. Results. Meta-analysis across all four cohorts identified 337 high-confidence T2D-associated genes (96.1% directional concordance in beta-cell-enriched tissue). These were distilled into two refined 14-gene modules: ImmuneStress (MICB, HLA-DRA, HLA-DPA1, IL1R2, and others) and BetaCellIdentitySecretion (RASGRP1, PPP1R1A, SLC2A2, and others), whose composite IsletDysfunctionScore provided the most stable cross-platform separation of non-diabetic from T2D islets (Hedges' g = 1.80, p = 9.83 x $10^-17$, $text{I}^2$= 0%). Consistent with progressive disease, IsletDysfunctionScore increased monotonically from non-diabetic to impaired glucose tolerance to T2D. Separately, the machine learning pipeline derived a 10-gene diagnostic panel: GABRA2, SLC2A2, ARG2, DKK3, PRIMA1, TAFA4, HHATL, PARVG, RNU1-70P, and the novel lncRNA ENSG00000284653, that achieved perfect discrimination in LOOCV (AUC = 1.000, sensitivity = 1.000, specificity = 1.000, zero misclassifications across all 57 donors). A leakage-verification experiment confirmed that this performance reflected genuine biological signal: global quantile normalization prior to cross-validation collapsed AUC to 0.380. External testing showed that 8 of the 10 panel genes were measurable in GSE50244. The frozen 8-gene reduced score retained strong discrimination (external AUC = 0.907), with 6 of 8 genes preserving directional concordance, but the discovery-derived threshold did not transfer because the external score distribution was shifted upward and compressed, yielding complete sensitivity but zero specificity at the frozen cutoff Conclusions. Integrating pathway-level meta-analysis with machine learning classification, we present a coherent two-axis model: immune/stress activation and loss of beta-cell identity/secretory competence, together with a compact, biologically interpretable 10-gene diagnostic signature. Panel genes converge on GABA signaling, glucose transport, arginine metabolism, WNT pathway inhibition, and a novel lncRNA, providing both mechanistic hypotheses and high-priority targets for external validation. These findings offer a reproducible transcriptomic scaffold for future mechanistic, biomarker, and clinical translation studies of human islet dysfunction. They also support external transportability of the core biological signal, while indicating that absolute operating thresholds are cohort-dependent and would require recalibration before deployment in independent datasets.
- Tail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation
Adaptive context selection is critical for retrieval-augmented generation (RAG) systems, as fixed Top-K retrieval fails under query-dependent and heavy-tailed similarity distributions. While Extreme Value Theory (EVT) offers a principled framework for adaptive truncation, existing approaches apply E...
- FAST-MEL: A Fast, Accurate, and Storage Efficient Solution for Multimodal Entity Linking
Multimodal entity linking (MEL) is the task that consists of matching textual and visual mentions of entities in unstructured data to their corresponding entities in a knowledge base (KB). To be effective in large-scale practical settings, MEL systems must meet three objectives: high linking accurac...
- CompRank: Efficient LLM Reranking via Token-Level Compression and Decoding-Free Scoring
Large language model (LLM) rerankers have become an important component of modern retrieval and retrieval-augmented generation pipelines, but their high computational cost limits their applicability to long candidate lists. In this paper, we propose \textbf{CompRank}, a token-efficient reranking fra...
- DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors
High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of ...
- CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding
Code retrieval is becoming central to coding agents, but agentic coding requires more than matching a natural-language query to an isolated snippet. Given a user request, a coding agent needs to navigate a concrete repository state, locate relevant files and functions, gather supporting context, and...
- What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study
We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is realizable as a result of top-$k$ retrieval by some query vector. Recent work shows that $d = O(k)$ suffices for such embeddings to exist in $\mathbb{R}^d$, inde...
- POISE: Spectral Inference of Parent-of-Origin Effects in Unlabeled Genomic Data
Motivation: Parent of Origin Effects (POEs), where the effect of an an allele on a phenotype differs based on maternal or paternal inheritance implicated in growth, metabolism, and neurodevelopment. Traditional tests for POEs require family data to determine parental origins of transmitted alleles. Given that such studies are expensive and time consuming compared to genome-wide association studies (GWAS), tests that function absent inheritance information are highly desirable. We develop a method, based on community detection from machine learning, that infers POEs via a spectral decomposition, obtains confidence intervals via a non-parametric bootstrap, and safeguards against confounding by non POE sources of variation. We refer to our method as Parent of Origin Inference via Spectral Estimation (POISE). Results: We demonstrate that POISE is well-calibrated under both Gaussian and heavy-tailed noise in simulation studies, with improved robustness to true POEs compared to existing covariance-based tests. POISE provides per-trait effect estimates with bias-corrected bootstrap confidence intervals and incorporates an information-theoretic minimum detectable effect size that filters unreliable estimates, conferring robustness to covariance-deflating variance QTL. We then apply POISE to GWAS data from the UK Biobank using BMI, LDL cholesterol, and HDL cholesterol. POISE recovers established POE loci and identifies 134 additional variants at genes implicated in lipid metabolism, immune regulation, and growth. Availability and implementation: The code for this method in Python is available at https://github.com/bystrogenomics/POISE.
- Wnt/β-catenin signaling regulates fibrotic atrophy of intra-articular adipose tissue in post-traumatic osteoarthritis
Synovial joints like the knee are home to adipose tissue depots whose anatomy and functions are closely intertwined with that of other intra-articular soft tissues such as synovium, underscoring the growing understanding that joints are multi-tissue organs. Traumatic joint injury and the onset of osteoarthritis (OA) dramatically remodel the intra-articular adipose niche, marked by infiltration of fibrotic tissue postulated to underpin OA-associated joint stiffness and pain, yet we know very little about the disease-associated dynamics of joint adipose remodeling nor the mechanisms driving these phenomena. Here, we employed 2D histomorphometry and spatial transcriptomics, alongside 3D osmium tetroxide-enhanced micro-computed tomography to comprehensively define the spatiotemporal, structural, and transcriptional rewiring of joint adipose tissue in a non-invasive mouse model of post-traumatic osteoarthritis (PTOA). These revealed marked loss of intra-articular adiposity accompanied by expansion of fibroblast-rich, collagen-dense tissue with pro-fibrotic hallmarks and Wnt/{beta}-catenin-enriched gene programs. Joint adipose exhibited a distinct transcriptional signature compared to subcutaneous white adipose tissue, pointing to unique, depot-specific functions. Stromal cells isolated from PTOA joints had heightened baseline expression of fibrotic and Wnt pathway genes and exhibited impaired de novo adipogenesis, in contrast to cells derived from healthy joints. In accordance with the destabilized biomechanics of PTOA joints, in vitro modeling demonstrated that prolonged, injurious loading and perturbed Wnt/{beta}-catenin signaling were convergent anti-adipogenic cues that suppressed lipid droplet formation and adipogenic gene induction, while promoting markers of fibrosis in joint-derived stromal cells. Complementary gain-of-function studies using ex vivo joint adipose explants and in vivo joint injections demonstrated that chronic Wnt/{beta}-catenin activation, as seen in OA joints, is sufficient to diminish the intra-articular adipogenic program and shift adipose to a more fibrotic phenotype, independent of joint injury. Collectively, these findings establish a multi-modal framework for quantifying joint adipose atrophy and implicate aberrant Wnt/{beta}-catenin signaling and pathological mechanical loading as key factors impairing de novo adipogenesis and driving fibrotic remodeling of intra-articular adipose tissue in PTOA.
- Exploratory Assessment of Pulsed-Wave Doppler Representations of Lung Sounds Using Deep Learning: An In-Vitro Phantom Study
The increasing availability of portable ultrasound systems motivates exploration of novel approaches to respiratory signal assessment. In this in-vitro study, we investigate whether pulsed-wave (PW) Doppler ultrasound can capture structured spectral patterns from replayed lung sound recordings. Digitized respiratory sounds were replayed through a tissue-mimicking ultrasound phantom, generating 1,478 PW Doppler spectral images from recordings associated with healthy subjects and several externally labeled disease categories. Exploratory classification experiments using a ResNet-18 architecture demonstrated that these Doppler representations contain learnable differences under controlled conditions. These findings motivate further investigation into PW Doppler as a potential representation of respiratory acoustics.
- A Heterogeneous Graph Neural Network Framework for Multi-Horizon Stroke Mortality Prediction
Background: Machine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons. Methods: We developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph. Results: We included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [≥] , 75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity. Conclusions: StrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.
- A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety
Objective. To introduce PsiBench, a clinically validated medication-safety benchmark for evaluating large language models (LLMs) against the standards used to certify hospital computerized provider order entry (CPOE) and electronic health record (EHR) systems, and a non-overlapping three-tier evaluation framework separating highest-stakes discrimination, the operational CDS regime, and category-correct alerting. Materials and Methods. PsiBench comprises 492 medication-safety scenarios across 11 safety categories, created by clinical pharmacology experts whose work underpins an annualized testing procedure used by more than 2,000 U.S. hospitals. The three-tier framework partitions the scenarios non-overlappingly: Discrimination (98 scenarios, 50 fatal vs 48 deception, near-balanced 51%/49%); Operational (394 scenarios, 261 serious unsafe plus 133 safe including 41 Excessive Alerts reclassified as operational negatives); and Attribution (311 alert-required scenarios). We evaluated 40 frontier LLMs from 10 providers over 3 runs per scenario at temperature 0.2 (or the provider default where temperature is not configurable), yielding 59,040 evaluations conducted April 21-23, 2026. Results. Headline binary performance on the full benchmark spans a wide range across the 40 models: F1 78.5%-92.3%, accuracy 65.4%-89.8%, sensitivity 81.4%-100.0%, specificity 6.1%-81.8%. Leading models by F1 (o4-mini 92.3%; o3 92.2%) pair high sensitivity with meaningful specificity; three models saturate sensitivity at 100% but fall below 25% specificity, indistinguishable from a naive always-alert classifier. The wide spread on a single headline metric motivates tier-specific analyses, developed in a separate clinical paper. Discussion and Conclusion. PsiBench and the three-tier framework operationalize a rigorous evaluation rubric for LLM medication safety, grounded in two decades of national hospital audit experience. The framework generalizes to any binary medication-safety classifier (rule-based, conventional ML, or LLM-driven), supporting tier-aware model selection and post-deployment surveillance.
- Assessment of the accuracy of lung lesions diagnosis in adolescents with osteosarcoma using artificial intelligence
Background. Lung metastases in osteosarcoma (OS) are the main cause of the death. The accuracy of the diagnosis of nodules by computed tomography (CT) of the lungs is critically important for determining the disseminated stage of the disease and planning surgical treatment. The use of artificial intelligence (AI) in the search for lung nodules increases the accuracy of diagnosis and reduces the chance of missing metastases. Objective: to evaluate the accuracy of lung nodules diagnosis in adolescents with OS using AI. Methods. A retrospective assessment of CT scans of adolescents with OS was performed. A pathological nodule with an average size of [≥]4 mm was considered a target finding. The diagnostic accuracy of an AI algorithm previously trained on an adult dataset was evaluated, and the number of false positives (FP) and false negatives (FN) was determined. Sensitivity, specificity, accuracy, area under the ROC curve (AUC), positive predictive value, negative predictive value, and F1-measure were calculated. Based on the obtained results, the effectiveness of the algorithm was assessed. Results. 248 CT scans of adolescents with OS were evaluated. The following results were obtained: in 5 cases, the AI algorithm showed a FP result (2.02%), in 34 cases, it showed a FN result (13.71%), and in 209 cases, a correct result (both true positive and true negative) (84.27%). The diagnostic accuracy of the algorithm was 0.843 (95% CI 0.794-0.887). The application of the AI algorithm in the practice of an X-ray doctor in a specific clinical task would allow to increase the sensitivity from 0.805 to 0.891, while ensuring an absolute decrease in the number of FN results by 8.59% and a relative decrease by 44%. Conclusion. The obtained results confirm the practical value of the application of the AI algorithm and justify the implementation of AI-assisted systems in the diagnostic protocols for lung metastases in adolescents with OS.
- General-purpose large language models can achieve physician-level accuracy in complex medical data extraction
Background: Unstructured data represent about 80% of total electronic health records (EHR) data. Structuring this free text is essential for advancing clinical research, including cohort selection for trials, retrospective studies, and the development of disease registries. While manual chart review (MCR) remains the gold standard for extracting this clinical data, the process is inherently slow, resource-intensive, and susceptible to errors from human fatigue. We evaluated the extraction accuracy, safety, and efficiency of the HeLIX (Hepatology Logic-Integrated Extraction) framework, a Large Language Model (LLM) protocol using Google Gemini 3 Pro, compared to a gold-standard Manual Chart Review (MCR). Methods: A prospective validation study was conducted using 50 high-complexity, simulated hepatology discharge summaries designed to replicate the real-world heterogeneity of EHRs. The HeLIX framework employed a Zero-Shot, Structured Chain-of-Thought (CoT) prompting strategy enforced by a three-layer architecture: Clinical Reasoning Trace, Schema Enforcement, and Evidence Verification. The model extracted 45 distinct clinical variables. Performance was benchmarked against a consensus MCR. Results: Across 2,250 evaluated data points, the model achieved an overall Extraction Accuracy of 99.24% (95% CI: 98.8%-99.5%), with perfect concordance in 35/45 (77.8%) variables. For binary diagnostic variables, the model demonstrated an overall F1-score of 0.98, Recall of 0.99 and substantial inter-rater reliability (Cohens {kappa} = 0.97). Hallucinations were exceptionally rare (2/2250; 0.08%). Critical errors affecting clinical management occurred in only 2 instances (<0.1% of total data), both involving etiological misattribution in complex multifactorial diagnoses. The AI workflow was 13.4-fold faster and 95.1% more cost-effective than manual extraction. Conclusion: The HeLIX framework demonstrates physician-level accuracy and reliability in extracting complex hepatology data. It offers a scalable, efficient, and economical alternative to manual chart review. Such frameworks could accelerate clinical research, enabling healthcare systems globally to build comprehensive patient registries for a fraction of the traditional cost.
- Towards the Virtual Amyotrophic Lateral Sclerosis Patient: Inferring Cortical Excitability through Whole-Brain Dynamical Modeling
Amyotrophic lateral sclerosis (ALS) is increasingly recognized as a multisystem neurodegenerative disorder in which motor-neuron degeneration is accompanied by widespread alterations in cortical dynamics. Among its most reproducible neurophysiological signatures is cortical hyperexcitability, yet how this local excitability imbalance shapes distributed whole-brain activity remains poorly understood. Here, we combined source-reconstructed resting-state MEG data, tractography-informed whole-brain modeling, and simulation-based inference to investigate whether ALS-related alterations in large-scale brain dynamics can be mechanistically explained by changes in cortical excitability. First, we characterized empirical brain dynamics using complementary features spanning regional activity amplitude and variability, functional connectivity, and avalanche-based metrics. These analyses revealed significant alterations in ALS patients relative to healthy controls, as well as associations with clinical impairment and disease staging. To mechanistically interpret these changes, we employed a reduced Wong-Wang whole-brain model in which local recurrent excitation modulates emergent large-scale neural dynamics. Simulations showed that increasing excitability systematically reproduced the empirical dynamical signatures observed in ALS. We then applied a simulation-based inference framework to estimate latent excitability parameters directly from empirical observations. Whole-brain model inversion revealed increased excitability in ALS patients compared with controls. The recovered excitability parameter was associated with disease staging, supporting its clinical relevance as a model-derived descriptor of ALS progression. Finally, by extending the model to estimate frontal and non-frontal excitability separately, we found that ALS-related alterations were predominantly associated with increased frontal excitability, whereas non-frontal regions appeared comparatively less affected. The recovered parameters related to disease staging. Together, these findings provide a mechanistic framework linking altered large-scale brain dynamics in ALS to selective cortical hyperexcitability, explaining how local excitability changes can give rise to global network reorganization. More broadly, they show how computational model inversion can recover latent multiscale pathophysiological processes from empirical neural recordings, offering a non-perturbative alternative to complex experimental paradigms typically required to causally probe local-to-global mechanisms.
- Development of a Novel Blood-Based Assay for Brain-Derived Tau and Its Validation in Traumatic Brain Injury
Brain-derived tau (BD-tau) is an emerging blood-based biomarker for neurodegeneration, yet there are currently limited well validated BD-tau assays available for research and clinical use. To enhance access to this vital biomarker for neurological disorders including traumatic brain injury (TBI), we developed a novel blood-based immunoassay for BD-tau on the ultra-sensitive Quanterix HD-X platform using Single Molecule Array technology. Analytical validation assessed dilution linearity, specificity, precision, detection limits, and spike recovery, each recording robust metrics in agreement with international expert recommendations. The assay demonstrated robust validation metrics, achieving between-run stability of 95% when analyzing aliquots from six independent plasma and serum samples across five analytical runs. It also showed strong dilution linearity when diluted four-fold and achieved over 90% recovery when spiked with cerebrospinal fluid. Next, we evaluated the clinical utility of the assay in cohorts of individuals with traumatic brain injury (TBI), where strong performances were recorded whether using the 2-step or 3-step assay formats ({rho}= 0.94; p < 0.0001). Furthermore, plasma BD-tau distinguished samples from TBI patients based on time from injury and severity (AUC=0.93). Plasma BD-tau differentiated between favorable and unfavorable functional outcomes in the acute-severe group. Our findings underscore the significant potential of the BD-tau assay as a biomarker for TBI in the severe phase.
- Trajectories of brain structure and function in young adult carriers of genetic frontotemporal dementia variants
Background and Objectives: Converging evidence hints at neurodevelopmental effects in genetic frontotemporal degeneration (FTD). In cross-sectional studies, for some genes, young adult FTD variant carriers show differences in brain volumes and cognition compared to familial non-carriers. However, longitudinal trajectories may more sensitively capture FTD-related neurodevelopmental vs. neurodegenerative changes than cross-sectional approaches. This study examined longitudinal trajectories of brain volumes, executive function, and plasma biomarkers in young adult carriers compared to familial non-carriers, as measures of neurodevelopmental and neurodegenerative outcomes of FTD-causing variants. Methods: This longitudinal cohort study comprised participants, aged 18-30 years, from the FTD Prevention Initiative across Europe, Canada, and the USA. Genetic groups included C9orf72 (47%), MAPT (30%), and GRN (23%). Linear mixed-effects models were computed to assess longitudinal outcomes across age between groups, controlling for sex, scanner (for brain volumes), and education (for executive function); random effects accounted for between-subject variability nested within family membership. Results: Variant carriers (n=147) and familial non-carriers (n=113) did not differ in age (mean{+/-}SD, 25.9{+/-}3.2 years), sex (53% female), or number of visits (2.1{+/-}1.7). Young adult C9orf72 repeat expansion carriers exhibited smaller thalamic volumes than non-carriers at the reference age of 26 years (b=-982.8mm3, SE=317.0, p=0.0046, f2=0.32), with relatively stable trajectories across ages 18-30 (i.e., no change over time). Trajectories of rostral anterior cingulate volumes differed in C9orf72 carriers and non-carriers across age, where carriers showed relatively stable trajectories and non-carriers showed age-appropriate declines (b=64.4mm3, SE=29.9, p=0.035, f2=0.07). For MAPT and GRN, there were little to no differences in total brain, cortical, or subcortical volumes between groups and over time. No longitudinal differences were observed between carriers and non-carriers in executive function, or plasma NfL or GFAP for any genetic group. Discussion: C9orf72 repeat expansions were linked to smaller average thalamic volumes and stable trajectories between ages 18 to 30, supporting potential neurodevelopmental origins. The modest evidence supporting an absence of difference in neurodegenerative biomarkers and executive function suggests minimal early neurodegeneration and functional preservation in young adulthood.
- DocLang aims to make documents readable by AI, not humans
AIs struggle to understand documents designed for humans; the DocLang working group seeks to flip that imbalance with its specification for machine-readable business documents “built from the ground up for LLM tokenizers.” The working group, founded by IBM, Nvidia, and Red Hat and hosted by the Linux Foundation’s LF AI & Data project, aims to create an open, universal, AI-native document format designed to improve how enterprises prepare, exchange, and govern document data for AI systems. ABBYY and Human Signal will also be involved in its development, and other contributors are welcome. “Enterprises today work across a fragmented landscape of document formats, including PDFs, JPEGs, and other file types built primarily for human consumption rather than AI interpretation,” the group said in its launch announcement . “This disconnect can introduce complexity, raise costs, and reduce reliability when extracting meaning from business documents,” as organizations increasingly rely on generative AI and agentic systems, it said. Mark Collier , executive director of LF AI & Data, said the goal of the DocLang Specification Working Group is to “develop a vendor-neutral, interoperable standard that helps organizations prepare document data for AI more reliably, transparently, and at scale.” DocLang defines a structured, machine-readable format for documents of any type, like JSON for data, that any tool can implement and any pipeline can consume. It builds on DocLing , a document processing toolkit hosted by LF AI & Data that can transform human-readable PDFs, word processor documents or spreadsheets into structured data. Standards must evolve for AI Something like DocLang is needed, said independent technology analyst Carmi Levy . “Existing document standards have done an admirable job allowing global stakeholders to confidently collaborate for decades, but it’s becoming increasingly clear that they are in desperate need of an update as AI reshapes the rules around how work gets done,” he explained. Largely static document types, he said, “can be somewhat limiting when AI is redefining the very word, ‘document.’ In many ways. AI-age documents are far more iterative and dynamic than what they once were, and the definitions need to evolve with the times. The documents we currently live with simply weren’t designed for the AI age.” Within that context, Levy said, “DocLang represents an early, best hope of achieving some kind of foundational baseline for document standards, one that will hopefully allow more intelligent, more efficient, lower-risk workflows than is currently the case.” Taking an open-source, vendor-agnostic approach to the process ensures the collective will take precedence over the needs of specific vendors, he said, adding, “earlier standards-setting efforts around networking, documentation, the web, and the cloud powered the free-flowing digital landscape that defines modern life.” An AI-centric documentation standard will carry that reality into the next generation of technology, said Levy. A question of governance The entire concept of LLMs, Jason Andersen , principal analyst at Moor Insights & Strategy said, “involves using natural human languages. The computer is supposed to understand us without us changing our syntax or language. Forcing a syntax on users is exactly what we have today with SEO and more advanced programming languages.” With something like DocLang, where the standard can be applied to content ingestion, he said, “I would be OK with that being automated, which seems to be the intent. The use case I envision is that when I upload a document to an agent, a skill can be run to preprocess the document into the DocLang standard format, saving tokens.” That makes sense, he said, adding that he thinks it’s good “if it can help generate outputs, like a visualization, that can be shared outside an AI tool. On that front, that is also why I am liking Web MCP, since you are just adding some code to the page, like CSS or JavaScript, and the consumer, in this case, an AI browser or skill, is better equipped to handle the site.” The point, he said, is, “these standards need to preserve the fact that humans can still do what they want, and do not need to know any coding to be proficient. In terms of governance, I am not sure if it matters.” But one analyst did foresee governance problems arising from DocLang’s use. Yaz Palanichamy , senior research analyst at Info-Tech Research Group, said DocLang adoption will require organizations to implement and review controls in order to scale its use accountably and securely. This article originally appeared on CIO.com .
- VibeSafe
Security audit + AI guardrails for vibe-coded apps
- Cortical activity during narrative discourse production in individuals with post-stroke aphasia and controls measured via functional near-infrared spectroscopy
Introduction: Aphasia is an acquired language disorder with a significant negative functional impact. Much of the research on aphasia has focused on word-level language comprehension and production. Further evaluation of discourse-level tasks, both at behavioral and neural levels, will allow for an ecologically valid understanding of the functional implications of language impairment in this population. Method: This study evaluated bilateral frontal, temporal, and parietal cortical activity during computer-based narrative production in 14 young neurotypical individuals, 17 individuals with post-stroke aphasia, and 15 age-matched neurotypical participants using functional near-infrared spectroscopy (fNIRS). Oxygenated hemoglobin (HbO) was measured during narrative production following short video clips and compared to HbO during counting aloud. In addition, behavioral measures quantifying in-task performance were correlated with averaged HbO values. Results: Young neurotypical individuals showed greater cortical activity in bilateral language regions for narrative production compared to counting aloud. In contrast, people with aphasia showed positive condition-related effects in the right frontal ROI and the age-matched group showed positive condition-related effects in the left frontal and right precentral ROIs. Each group showed different patterns in relationships between cortical activity and discourse performance measures. Conclusion: Overall, young participants showing more consistent condition-related effects for narrative discourse production than individuals with aphasia and age-matched controls. This study shows the potential for fNIRS to evaluate cortical activity for ecologically valid language tasks in individuals with post-stroke aphasia.