AI News Archive: June 16, 2026 — Part 16
Sourced from 500+ daily AI sources, scored by relevance.
- Cordon: Semantic Transactions for Tool-Using LLM Agents
Tool-using LLM agents are shifting the unit of computation from explicit human-issued commands to model-driven tasks with stateful consequences. Yet today's agent runtimes still expose tools as isolated RPCs. This interface gives runtimes a convenient integration point, but it lacks a task-scoped ex...
- PARSE: Provenance-Aware Retrieval Sanitization for Professional Domain LLM Agents
Prompt injection defenses evaluated on synthetic benchmarks do not generalize to real enterprise documents, which are longer, denser, and interleave legitimate authority language with factual content. We demonstrate this gap with a real-document benchmark of 122 tasks across five professional domain...
- Bifrost: Hybrid TEE-FHE Inference for Privacy-Preserving Transformer and LLM Serving
Cloud-hosted transformer and large language model (LLM) inference creates a direct confidentiality problem: user prompts may contain sensitive code, business data, personal information, or regulated documents, yet remote serving exposes intermediate state to the cloud software stack and accelerator ...
- SoK: AI-Augmented Binary Reversing
Binary reversing is fundamental to software understanding, vulnerability discovery, malware investigation, and firmware auditing. However, it remains inherently challenging due to the irreversible loss of semantic information during compilation. Recent advances in machine learning, large language mo...
- An Empirical Analysis of AI Slop in Music Streaming
Generative AI models lower the bar for content creation, making it easy for any user to create professional-looking images, text and music with minimal effort. This has enabled a new cottage industry around creation of "AI slop" mass quantities of mediocre content produced to generate revenue, often...
- An AI Security Agent for Banking: Multi-Vector Fraud and AML Detection Across Retail and Corporate Accounts
Banks simultaneously face signature-based fraud (card-not-present attacks, account takeover, ATM cloning) and behavioural financial crime (structuring, layering, mule networks, business email compromise) -- two threat families with fundamentally different detection requirements. Static rule engines ...
- FacProcessTwin: An LLM-Based System for Process Twin Development
Process twins provide real-time representations of entire production processes. By capturing how process steps interact, rather than monitoring a single machine in isolation as an asset-based digital twin does, they have the potential to drive efficiency gains across the whole process. However, deve...
- PracRepair: LLM-Empowered Automated Program Repair Inspired by Human-Like Debugging Practices
As software systems grow in scale and complexity, debugging and repair remain costly and time-consuming. Large language models (LLMs) have advanced automated program repair (APR), but existing LLM-based APR approaches still largely rely on static or retrieved context, error messages, and coarse-grai...
- Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations
Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed. In this study, we go beyond quantitative LLM-human agreement metrics and qualitative...
- Unlocking LLM Code Correction with Iterative Feedback Loops
Large Language Models have shown remarkable capabilities in code generation. However, most existing evaluations focus only on single-attempt accuracy and overlook the iterative refinement process that is central to real-world programming. This study presents a systematic investigation of LLMs' abili...
- OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem
Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically comp...
- LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline
Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines that systematically ground model outputs in autho...
- All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code
Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verificatio...
- Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering
Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal ...
- Why Model Credibility Isn't Enough: -Rethinking Trust in Simulation Architectures
Credibility of a simulation model is an important topic. Several approaches try to quantify the credibility of simulation. However, models are mostly assembled within a simulation architecture. Can the credibility of a simulation architecture be assessed based on the credibility of the models that c...
- Grounding Spoken LLMs in Multi-Speaker Audio via Diarization Conditioning
We propose diarization-conditioned spoken language models (SLMs), a strategy for extending SLMs to far-field multi-talker audio. Rather than adapting the decoder via Serialized Output Training, which risks catastrophic forgetting, we condition the acoustic encoder on diarization masks to extract tar...
- One-Step Token-to-Waveform Generation with MeanFlow in Latent Space
Neural audio codecs are central to modern LLM-based Text-to-Speech (TTS) and multimodal systems. As low-bitrate semantic codecs gain prominence, the Token-to-Waveform (Token2Wav) decoder becomes a bottleneck determining both perceptual quality and system efficiency. Conventional multi-step flow-matc...
- PhASE-Flow: Phonetic-Conditioned Acoustic Flow Matching in SSL Representation Domain for Speech Enhancement
Flow matching (FM) enables high-fidelity generation, while self-supervised learning (SSL) speech models provide hierarchical representations spanning acoustic and phonetic levels. However, existing FM-based speech enhancement (SE) methods operate primarily in the spectral domain, treating SSL featur...
- An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages
Synthetic data has the potential to be a valuable resource for training machine learning models, particularly Automatic Speech Recognition (ASR) Systems; however, its effectiveness requires systematic evaluation. In this study, we investigate the impact of incorporating synthetic speech data alongsi...
- Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition
Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is degraded because NAR decoding cannot resolve uncertainty by conditioning...
- AI-based Cognitive-linguistic Features for Dementia Assessment in Picture Description
Picture descriptions provide valuable insights into several clinical constructs related to cognitive-linguistic abilities. However, operationalizing these constructs into quantitative measures remains challenging, limiting interpretability and clinical utility. We introduced seven constructs tailore...
- ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation
Text-to-audio (TTA) generation, synthesizing audio from natural language, has been widely studied for its ability to capture precise user intent. To effectively advance TTA models, it is essential to reliably evaluate generated audio without relying on costly human subjective ratings, motivating the...
- Do Generative Recommenders Deepen the Information Cocoon? A Closed-Loop Simulation with LLM-powered User Simulators
Recommender systems alleviate information overload, yet repeated feedback between recommendations and user interactions can reinforce existing preferences and narrow users' exposure, forming information cocoons. While this phenomenon has been widely studied in traditional sequential recommendation, ...
- Temporal Preference Optimization for Unsupervised Retrieval
Unsupervised dense retrievers offer scalability by learning semantic similarity from unlabeled documents via contrastive learning, but they struggle to capture the temporal relevance, retrieving semantically related but temporally misaligned documents-an important aspect when a document collection s...
- RSRank: Learning Relevance from Representational Shifts
As enterprises deploy RAG-based systems to provide grounded responses to user queries, reranking has become a critical component for the final filtering step that separates relevant from distracting or irrelevant documents. Existing rerankers often rely on heuristic thresholds to achieve optimal fil...
- Integrative Transfer Network: Deep Transfer Learning Across Populations and Prediction Targets
Large-scale clinical and biomedical datasets increasingly contain both diverse subgroup attributes (e.g., demographic or clinical subgroups) and multiple prediction targets. Although various machine learning approaches can address subgroup differences or multi-target prediction, they often consider these aspects independently rather than jointly. To more effectively capture the shared and subgroup-specific information in such complex datasets, we propose the Integrative Transfer Network (ITN), a deep neural network designed to leverage data across subgroups and multiple related outcomes simultaneously. In extensive experiments, including time-to-event and classification tasks where demographic subgroups and multiple disease endpoints are prevalent, ITN demonstrates consistent improvements in subgroup-specific prediction by borrowing strength from other subgroups and outcomes. We envision ITN as a unified framework for learning from heterogeneous datasets where subgroup-specific insights are critical.
- A MULTICENTER SWEDISH HISTOPATHOLOGY IMAGE DATASET OF PEDIATRIC CENTRAL NERVOUS SYSTEM TUMORS
Refined detection methods, more detailed tumor characterization, and adequate distinction between different pediatric tumor subtypes are necessary to improve diagnosis and treatment, enable precision medicine, and advance patient prognosis. However, the application of computational approaches to pediatric brain tumors remains limited, largely due to the lack of accessible datasets. To address part of this gap, we provide whole slide images (WSIs) of hematoxylin and eosin (H&E)-stained tissue sections from all pediatric central nervous system (CNS) samples collected in Sweden between 2013 and 2023. These data represent a population-based national cohort encompassing all six pediatric oncology centers in Sweden and are available through the Swedish Childhood Tumor Biobank (BTB). The dataset includes 1,446 WSIs of sufficient image quality with confirmed CNS tumor diagnoses, derived from 537 unique subjects (562 cases). In addition, diagnosticrelevant clinical information is included. Corresponding whole-genome sequencing (WGS), wholetranscriptome sequencing (WTS), and methylation array data are available for most tumor samples through separate resources. This H&E dataset has been specifically curated to support artificial intelligence-based analyses, while also serving broader applications in medical research and education. When combined with matched molecular data, it provides a valuable resource for advancing multimodal and precision diagnostic approaches in the pediatric population. Refined detection methods, more detailed tumor mapping and adequate distinction between different subtypes of pediatric tumors are necessary to improve treatment, enable precision medicine and improve patient prognosis. Application of computational algorithms for pediatric brain tumors is very limited mainly due to the unavailability of pediatric histology brain tumor data sets. To enable the development of AI models comprehensive datasets covering a wide range of pediatric brain tumors are needed.
- Physics-Informed Operator Learning for Pulsatile Milk Flow in Distal Generations of a Bifurcated Mammary Duct Network
Pulsatile milk transport through the lactating mammary ductal tree involves complex interactions between pressure gradients, wall compliance, and non-Newtonian rheology across spatial scales that span nearly two orders of magnitude in lumen radius. Direct experimental characterization of flow in distal ductal generations remains infeasible due to their sub-millimeter caliber, leaving the hemodynamic environment of the secretory ductules largely unknown. We present a two-stage physics-informed operator-learning framework that extends validated flow predictions from three instrumented duct generations to twenty generations of a bifurcated mammary network. A Physics-Informed Neural Network (PINN) trained against particle image velocimetry measurements across seven ducts achieved R^2 = 0.924-0.997. A Deep Operator Network (DeepONet) distilled from the PINN and refined through physics-constrained training on the governing one-dimensional fluid-structure interaction equations achieved R^2(u) = 0.857-0.985 across all validated ducts, with predictions for Generations 4-20 obtained by supplying Murray's Law geometry and mass-conservation-scaled boundary conditions to the frozen operator. Three biophysically significant findings emerge: a mean velocity plateau of 0.14-0.18 m/s across Generations 4-13 produced by Cross shear-thinning compensation offsetting Murray-branching deceleration; a non-monotonic pulsatility index that declines from 0.048 at Generation 1 to a minimum of 0.039 at Generation 5 before rising monotonically to 1.37 at Generation 20 as progressive wall stiffening drives the most distal ductules into a microcirculation-like hemodynamic regime; and a brief elastic-recoil transition zone at Generations 4-5 where mean axial pressure drop reverses sign. To the authors' knowledge, these results provide the first quantitative characterization of pulsatile milk flow across the full hierarchy of a bifurcated mammary ductal tree using a physics-informed operator-learning framework with implications for ductal mechanobiology, milk ejection mechanics, and mastitis pathogenesis.
- A Systematic Review and Independent Benchmarking of Automated Nerve Morphometry Methods
Objective: To systematically review automated nerve morphometry tools and independently benchmark their performance on independent optic nerve datasets. Design: Systematic review and comparative benchmarking study. Controls: Benchmarking was performed using paraphenylenediamine-stained mouse (n = 85) and rat (n = 44) optic nerve images with manually annotated axon counts as ground truth. Methods: Published studies describing automated or semi-automated neural tissue morphometry tools were identified through systematic searches of PubMed, Embase, and Scopus through January 2026 following PRISMA guidelines. Data extraction covered 70 fields across tool capabilities, imaging modality, species, automation level, and validation approach. Eighteen eligible tools (8 deep learning [DL], 10 classical computer vision [CV]) were benchmarked on both mouse and rat independent datasets. Main Outcome Measures: Performance was assessed by mean absolute percentage error (MAPE), Pearson correlation, and median predicted-to-ground-truth ratio. Tools were ranked per image and compared using Friedman tests with Nemenyi post-hoc analysis. Results: Seventy-one studies met inclusion criteria, spanning from 1999 to 2026. Deep learning methods represented 38% (27/71) of studies, increasing from 0% before 2017 to over 55% of publications after 2020. Axon counting was the most common output (73%, 52/71), while only 35% (25/71) reported g-ratio. Among benchmarked tools, Marina (CV, 2010) achieved the lowest average MAPE (32.9%). The top five tools (MAPE ranging from 32.9 to 44.8%) included both CV and DL methods and were statistically indistinguishable by Friedman-Nemenyi analysis (p > 0.05). Performance varied substantially across datasets: AxonJ (CV) achieved the second best MAPE on rat images (27.7%) but the worst on mouse images (438.6%). Conclusions: No single tool demonstrated consistently superior performance across both datasets. Classical and deep learning approaches achieved comparable accuracy for axon counting. Tool selection should be guided by target species, tissue preparation protocol, and desired morphometric outputs. This systematic review and independent benchmarking study provide an evidence base for tool selection in optic nerve research.
- Spectral decompositions of neural voltage recordings are susceptible to model misspecifications that cause meaningful estimation error
The power spectra of neural voltage recordings vary systematically across brain states and contain both narrowband (rhythmic) and broadband components. A large class of algorithms seeks to parametrize these spectra by separating rhythms from broadband structure, enabling many robust empirical findings. Here we show that two common assumptions underlying popular spectral decomposition methods are incompatible with standard physical and statistical properties of neural recordings: (1) field potentials arise from additive (linear) superposition of biophysical processes, yet several methods implicitly impose multiplicative structure; (2) power estimates are Gamma distributed, with variance proportional to squared power (heteroscedasticity), yet many methods assume Gaussian, homoscedastic errors across frequencies. Using simulations with known ground truth, we demonstrate how these misspecifications bias estimates of rhythm amplitude and broadband height/slope, even under well-behaved conditions. We introduce a corrected decomposition framework, released as the open-source package SL_specdecomp. Relative to the most widely used method, specparam, our approach recovers rhythms and broadband parameters accurately, while specparam decompositions are biased and can confound rhythmic peaks with broadband slope. We then apply these methods to monkey electrocorticography during propofol anesthesia. SL_specdecomp estimates a substantially steeper (more negative) 40--60~Hz broadband slope during anesthesia than during wakefulness, whereas specparam shows a smaller state difference. We show using simulation that the differences in the two decompositions can arise directly from specparam's model misspecification. We also introduce a formal method based on cross-validated log likelihood to compare candidate power spectral decompositions and show that it favors SL_specdecomp. These results suggest that misspecified decompositions can attenuate or distort broadband slope changes in the presence of strong rhythms, and motivate the use of SL_specdecomp as a more reliable decomposition tool.
- AI-assisted continuous-time modelling of metastatic breast cancer reveals subtype-specific spatiotemporal organ interactions
Metastatic breast cancer is one of the leading causes of premature mortality among women worldwide. A major barrier to optimal care is the marked heterogeneity in both the temporal dynamics of metastatic spread and the organ-specific spatial distribution of metastases. Existing analyses do not adequately capture this complexity, as they either neglect temporal dependencies or assume independence between metastasic sites. As a result, it remains unclear how established metastases influence subsequent organ-specific dissemination. We address this question using patient-level longitudinal trajectories from a large multicentre real-world metastatic breast cancer registry, combined with an AI-assisted disease-progression modelling framework based on continuous-time Markov chains that represent combinations of metastatic sites and the non-uniform and practice-driven timing of radiologic response assessments, as encountered in routine clinical care. We present a stochastic model determined by progression rates, which are parameterised to capture baseline organ-specific transition risks, patient-level covariates, and pairwise inter-organ interaction effects. High-dimensional treatment information is incorporated using an large language model based encoding. We find that metastatic spread follows non-independent, subtype-specific spatiotemporal patterns, with subtype-specific inter-organ interaction patterns that shape progression. Visceral metastases, particularly lung and liver metastasis, are associated with an increased hazard of subsequent brain metastasis, with effects varying across hormone receptor-positive, HER2-positive, and triple-negative subtypes. Together, these findings define a clinically relevant spatiotemporal architecture of metastatic progression in breast cancer. This framework enables refined mechanism-informed risk stratification and provides a data-driven rationale for targeted and risk-adapted -- rather than symptom-triggered -- surveillance strategies.
- Adverse Childhood Experiences and Growth Outcomes in Childhood: A Longitudinal EHR-Based Study
Question Are adverse childhood experiences (ACEs) associated with altered growth trajectories in childhood? Findings In this cohort study of 412,549 children and adolescents, ACEs were associated with lower height throughout childhood, earlier pubertal timing, and shorter final stature. Height differences emerged approximately 2 years before ACE documentation and were greatest among those with earlier documentation. Meaning These findings suggest that early adversity affects physical growth in children and may serve as a measurable indicator of the biological consequences of early-life stress, especially in those with documentation of ACEs prior to the onset of typical pubertal growth. Importance Adverse childhood experiences (ACEs) are among the strongest risk factors for long-term mental and physical health complications, yet their impact on physical growth in childhood remains incompletely understood. Objective To determine the association of ACEs on childhood growth trajectories and growth dynamics. Design, Setting and Participants Retrospective cohort study using longitudinal electronic health record data. Data was collected from participants between February 1999 and August 2025. A large academic medical center biobank linked to deidentified electronic health records in the southeastern United States. A total of 412,549 individuals with at least 2 recorded height measurements between the ages of 2 and 20 were included in the primary analysis. Growth curve analyses were performed in a subset of 199,844 individuals with at least 3 height measurements spanning at least 2 years. Genetic analyses were performed in a subset of 10,114 individuals of primarily European ancestry. Exposure(s) Documented exposure to adverse childhood experiences before age 18 years identified through a natural language processing algorithm. Main Outcome(s) and Measure(s) Height-for-age z-scores across childhood, final attained height, and growth curve parameters estimated using SuperImposition by Translation and Rotation (SITAR) modeling. Results Among 412,549 participants, 18,502 (4.5%) had clinically documented ACEs during childhood. ACE documentation was associated with lower height-for-age z-scores throughout childhood and adolescence. Final attained height was significantly lower among ACE-documented individuals, with mean differences of -3.0 cm among males (174.0 cm vs 177.0 cm, p < 0.001) and -1.3 cm among females (161.8 cm vs 163.1 cm, p < 0.001). Height differences emerged approximately 2 years before clinical ACE documentation. Earlier age at first ACE documentation was associated with progressively shorter final attained height, with each year decrease in age at ACE documentation associated with a decrease in final height of -0.20 cm in females and -0.35 cm in males. Those with first ACE documented prior to pubertal age also showed the most pronounced growth dynamic differences, with males demonstrating a mean reduction in size of 5.25 cm (95% CI, -6.79 cm to -3.70 cm) and 1.26-year earlier pubertal timing (95% CI, -1.50 to -1.03 years), and females demonstrating a reduction in growth curve size of 3.62 cm (95% CI, -4.83 to -2.41 cm) and 1.14-year earlier pubertal timing (95% CI, -1.29 to -0.99 years). Conclusions and Relevance In this large clinical cohort, clinically documented ACEs were associated with time-dependent reductions in stature, earlier pubertal timing, and short final attained height. These findings suggest that early childhood adversity may have lasting effects on physical development and highlight growth trajectories as a potential marker of the biological consequences of early-life stress.
- Ranking-optimized survival models can underperform fixed-horizon clinical prediction: a SUPPORT2 reanalysis of machine learning, attending-physician judgment, and the original SUPPORT model at 60- and 180-day mortality
Machine-learning survival models are increasingly proposed for intensive-care mortality prediction and are almost always selected and reported using the concordance index, a ranking metric averaged over follow-up. Yet most bedside decisions hinge on a probability at a specific time, such as 60- or 180-day mortality. We asked whether ranking-optimized models remain competitive at fixed clinical horizons against two reference points clinicians actually rely on: unaided attending-physician judgment and the original 1995 SUPPORT logistic model. Reanalyzing the SUPPORT2 cohort (9,105 critically ill adults from five United States centers, 1989-1994) under a stratified 70/15/15 split, we compared a gradient-boosted survival model, the physician's recorded prognosis, and the 1995 model at 60 and 180 days, alongside several alternative learners. The survival model achieved competitive ranking concordance (0.705) yet underperformed both comparators at fixed horizons: at 60 days its area under the ROC curve was 0.750, against 0.808 for physicians on the matched sample and 0.827 for the 1995 model, a gap that held across eight independent data splits and remained statistically reliable after multiplicity correction. The shortfall was not miscalibration, since post-hoc recalibration left discrimination unchanged, nor limited capacity, since neural networks, a deep ranking model, and two timepoint-aware discrete-time models also failed to close it; replacing the ranking objective with timepoint-matched binary training recovered roughly half the gap, pointing to an objective-horizon mismatch. Discrimination was equitable across sex, race, and age, but leave-one-disease-out validation exposed severe failure for disease groups absent from training, and the physician advantage was conditional on a physician electing to provide an estimate. We recommend reporting timepoint-specific discrimination alongside concordance, timepoint-matched training when fixed-horizon predictions drive care, leave-one-subgroup validation, and distribution-free prediction intervals to support selective deployment.
- Utilising Artificial Intelligence to Identify Ventricular Tachycardia Ablation Targets in Sinus Rhythm
Background and Aims: Machine learning has shown potential in predicting ablation targets for ventricular tachycardia (VT) in an animal model. This study progresses to externally validating deep learning approaches for human data. Methods: The development and external validation dataset included 21 and 13 patients, respectively, with structural VT undergoing catheter ablation. In the development datasets, electrophysiological studies were conducted using the AdvisorTM HD grid (EnsiteTM X), while both CARTO and Ensite Precision were used in the validation dataset. In each patient, VT ablation targets were defined as mapping points within 8 mm of VT isthmuses. Three advanced machine learning models were trained using cardiac mapping data acquired in both omnipolar and unipolar configurations during sinus rhythm and ventricular pacing. Discrimination was evaluated using nested leave-one-out cross-validation at patient level. Results: Overall, graph convolutional networks (GCNs), which integrate intracardiac signal waveforms with three-dimensional electroanatomical geometries, achieved the highest performance, with optimal results obtained from unipolar electrograms acquired in sinus rhythm (median AUC 0.793, sensitivity 83.6%, specificity 69.0%). This may be partly explained by the inclusion of repolarization dynamics in unipolar electrograms and the higher point density of sinus rhythm maps. Comparable performance was observed in the external dataset. Conclusion: This study demonstrates that graph convolutional networks applied to sinus rhythm EGM waveforms collected during substrate mapping can localise critical components of VT re-entry circuits. This approach has potential to provide fast and accurate ablation guidance without the need to induce and map VT, improving safety and efficacy of VT catheter ablation.
- Validation of a Smartphone-Image-Based Computer-Vision Model for Lean Mass and Body Fat Estimation Against Dual-Energy X-ray Absorptiometry
Introduction Body composition, rather than body weight alone, is an increasingly important health metric, and preservation of lean mass has become a central concern in obesity treatment, aging, and chronic disease management. Dual-energy X-ray absorptiometry (DXA) provides accurate assessment of fat and lean tissue, but its cost and logistical requirements limit repeated measurement. Computer-vision approaches show promise for estimating adiposity from smartphone images, but lean-mass estimation remains less established. Methods We evaluated a computer-vision body composition model, applied to consumer-grade smartphone photographs, against DXA in a held-out validation sample of 195 adults from an ongoing cross-sectional study. Body fat percentage and total lean mass percentage were co-primary outcomes; for total lean mass percentage, an image-only configuration (no added covariates) was pre-specified as primary. Agreement was quantified using Lin's concordance correlation coefficient (CCC) as the lead statistic, with Pearson correlation, mean absolute error, root mean square error, mean bias, and Bland-Altman limits of agreement. In secondary analyses, appendicular lean mass and total lean mass percentage were each estimated with and without routine anthropometric and demographic inputs (body weight, height, age, and sex). Results Total lean mass percentage agreed with DXA from image features alone (CCC 0.916). Body fat percentage, estimated with routine inputs added, agreed at least as closely (CCC 0.930). Adding routine inputs barely changed agreement for total lean mass percentage but markedly improved it for appendicular lean mass, an absolute quantity that scales with body size. Conclusions A smartphone-image-based model estimated both body fat and lean mass with strong agreement to DXA, with lean mass percentage from image features alone. The approach needs no fixed equipment or ionizing radiation. Whether it can track change over time, including in incretin-based weight loss where lean mass preservation is a concern, was not assessed in this cross-sectional study.
- Non-invasive Detection of Fasciculation Using Surface EMG with a Wavelet-Based Analytical Method (DEWCS)
Objective: Needle electromyography (nEMG) is essential for diagnosing neuromuscular disorders but is invasive and often painful. We employed single-channel bipolar surface EMG (sEMG) analyzed with a novel wavelet-based analytical approach, Detecting and Extracting Elemental Wave Components based on a Wavelet Coefficient Set (DEWCS) and investigated whether fasciculation-related activity could be identified. Methods: In this prospective study, 28 patients undergoing nEMG for suspected neuromuscular disorders and 13 healthy controls were included. Resting-state sEMG was recorded from selected muscles using single-channel bipolar active electrodes at a high sampling rate. DEWCS was used to extract indices reflecting fast- and slow-type motor unit (MU)-related activity. These standardized indices were evaluated against nEMG-detected fasciculation potentials using generalized estimating equation logistic regression to account for within-subject clustering. Diagnostic performance was assessed by receiver operating characteristic analysis. Results: A total of 67 muscles from 38 participants were analyzed. Indices of fast- and slow-type MU-related activity were significantly associated with fasciculation potentials (slow: OR 5.10, p = 0.0041; fast: OR 2.38, p = 0.0162). The combined model showed excellent discrimination (area under the curve = 0.97), outperforming either index alone. Muscle region had no significant effect. Conclusions: A single-channel bipolar sEMG setup combined with DEWCS detected fasciculation-related activity with promising accuracy. This method may serve as a non-invasive surrogate marker of lower motor neuron involvement. Further validation in larger cohorts is warranted. Significance: This non-invasive sEMG approach may help detect fasciculation-related activity and complement nEMG in neuromuscular diagnostics.
- Development of an automated, imaging-based preoperative screening model for early identification of malnutrition in an abdominal surgery cohort
Background: Clinical malnutrition affects one in five abdominal surgery patients and increases postoperative complications and mortality. Current screening occurs after admission, closing the window for preoperative nutritional intervention. No objective, scalable preoperative screening tool exists. Objective: To determine whether automated volumetric CT-based body composition analysis improves preoperative identification of surgical patients at risk for clinical malnutrition compared to clinical variables or single slice imaging alone. Methods: Retrospective cohort study of adults undergoing elective abdominal surgery at a quaternary academic medical center (2018 to 2021) with a preoperative CT scan within 90 days and complete nutrition assessment. Clinical malnutrition was diagnosed by a registered dietitian using ASPEN/AND criteria. Three sex stratified Elastic Net models were compared: (1) base clinical variables; (2) base plus L3 single slice skeletal muscle index and attenuation; and (3) base plus comprehensive 3D volumetric quantification of five muscle groups and two fat depots. Discrimination (AUROC), calibration (Brier score), and clinical utility (decision curve analysis) were assessed via 10-fold cross-validation. Results: Among 1,143 patients (52.4% female; mean age 60.5 years), 231 (20.2%) were diagnosed with malnutrition. Malnourished patients had significantly higher complication rates (36.4% vs. 15.4%, p<0.001) and prolonged length of stay (45.9% vs. 16.4%, p<0.001). Critically, 27.2% of malnourished patients were not flagged as at-risk by the standard Malnutrition Screening Tool. The volumetric model (Model 3) achieved the highest discrimination (males: AUROC 0.808; females: 0.794) and best calibration (males: Brier 0.129; females: 0.124), significantly outperforming both the base model (males: p=0.004; females: p<0.001) and L3 model (males: p=0.019; females: p<0.001). L3 features modestly improved discrimination but paradoxically worsened calibration; an effect corrected by volumetric features. Sex-specific risk profiles differed markedly, with ASA classification dominating female models and demographic factors dominating male models. Conclusions: Automated volumetric CT body composition analysis significantly improves preoperative malnutrition risk identification, with sex-stratified models revealing distinct risk profiles. Leveraging imaging already obtained for surgical planning, this approach opens a preoperative window for nutritional intervention that current practice fails to utilize.
- Fidelity-Derived Quantum Dissimilarity-Enhanced k-Nearest Neighbor Algorithm for Arterial Hypertension Prediction
We present a quantum-enhanced version of the classic k-Nearest Neighbors (kNN) classification algorithm, applied to the prediction of arterial hypertension. The traditional Euclidean distance metric of the kNN algorithm is replaced with a Fidelity-derived quantum dissimilarity measure to evaluate the similarity between data samples. We map classical real-world clinical and ECG-derived data features into quantum states via the Dense-Angle Encoding, which efficiently utilizes parameterized rotation gates to pack multiple features into minimal qubits while maintaining pure states. We evaluate the performance of the dissimilarity measure using both the noiseless state vector Simulator and the IBM Qiskit Estimator primitives. The quantum circuit demonstrates robust predictive capabilities comparable to the classical model. While it does not claim computational supremacy over the classical baseline, the framework proves that fidelity-based similarity is a physically meaningful and efficient approach for hybrid quantum classical classification.
- Anthropic Is Still at Odds With the White House Over Claude Fable 5
Anthropic leaders flew to Washington, DC, to meet with White House officials on Monday. After high-level talks, they’re still split on the risk Claude Fable 5 presents.
- What is Anthropic’s Mythos AI and why was it blocked?
What is Anthropic’s Mythos AI and why was it blocked? The Straits Times
- Washington switched off the world’s most powerful AI, and Asia should be paying attention
A US national security directive pulled Claude Fable 5 and Mythos 5 offline for every foreign national three days after launch. For a region wiring its agentic future onto borrowed foundation models, it is the clearest demonstration yet of platform risk. On 9 June, Anthropic released Claude Fable 5, a model-independent tracker that Vals AI […] The post Washington switched off the world’s most powerful AI, and Asia should be paying attention appeared first on e27 .
- Carney Compares Anthropic Ban to 2008 Crisis at G7, Warns of AI Over-Reliance
Canadian PM Mark Carney warns G7 leaders of AI over-reliance, comparing Anthropic ban to 2008 crisis.
- Gov’t joins hands with Samsung to support AI chip development
Gov’t joins hands with Samsung to support AI chip development 매일경제
- Visa and OpenAI integrate Visa's secure global payment directly into ChatGPT
Would you hand an AI chatbot your credit card? This week, Visa's deal with ChatGPT maker OpenAI became the latest step in the march toward a future where AI offers to shop on your behalf.
- ContentGuard AI
Free AI Detector, Plagiarism & Grammar Checker
- NeuroCLI
AI -ReImagined
- TAI #209: Claude Fable 5 Arrived, Then the US Government Took It Offline
Also, GLM-5.2, Zamba2-VL, North Mini Code, and more! What happened this week in AI by Louie Anthropic released Claude Fable 5 on June 9. Two days later, it apologized for and reversed a controversial safeguard that could degrade the model on some machine-learning work. The next day, June 12 at 5:21 p.m. Eastern, the Commerce Department issued an unpublished export-control directive that, by Anthropic’s account, barred every foreign national from accessing Fable 5 or its restricted sibling Mythos 5, including Anthropic’s own foreign-national employees. Anthropic concluded the only workable way to comply was to switch both models off for everyone. Fable was live for three days. That was just long enough for a lot of us to start routing real work through it, which is exactly what made losing it sting. My early read is that Fable 5 was the largest jump in everyday work capability we have tested in a while. It runs on the same underlying model as Mythos 5, Anthropic’s restricted cyber-capable system, with classifiers and fallbacks layered on for cybersecurity, biology, chemistry, model distillation, and frontier AI development. It shipped with a one-million-token context window at $10 per million input tokens and $50 per million output tokens. The benchmark numbers were strong across the board. Fable scored 95% on SWE-bench Verified, 80% on the harder SWE-bench Pro, 84.3% on Terminal-Bench 2.1, and 85% on OSWorld-Verified. Its 29.3% on FrontierCode Diamond was more than double Claude Opus 4.8’s 13.4%. Mythos reached 88% on Terminal-Bench, which tells you Fable’s safeguards cost something while leaving most of the underlying capability intact. Independent testing backs a broad advance with real soft spots. Artificial Analysis ranked Fable first among 152 configurations with an Intelligence Index of 60 and found that it led Opus 4.8 by 9 points in its professional-work evaluation. But Fable only tied Opus on Terminal-Bench, trailed several older models in a banking test, and showed worse calibration: its non-hallucination score was 45% against Opus 4.8’s 64%. Exceptionally capable, then, but not uniformly better. Real-world examples are always more telling than the benchmark tables. Stripe said Fable completed a migration across a 50-million-line Ruby codebase in a single day, work a team would have spent more than two months on by hand. It rebuilt a web application from screenshots, extracted precise values from scientific charts, built a browser-based computer-aided design editor, and used that editor to produce a printable 3D model. It also wrote an eclipse-predicting solar-system simulation and played Pokémon FireRed from raw screenshots with no map or navigation aids. The scientific claims were the most striking and the ones to hold most loosely. Anthropic reported that Mythos sped up parts of drug design roughly tenfold and produced strong candidates for 9 of 14 protein targets. A week-long genomics run processed millions of cells from 138 species and trained a custom model that Anthropic says beat a much larger, more recent system. That genomics result is unpublished and deserves caution. The pattern beneath it all is the real story: Fable could move across papers, data, code, tools, images, and long-running executions without losing sight of the goal, unlike earlier Claude models. For the few days we had it, our default became to try Fable first on almost every work task. The most valuable human work still sat at the front end: choosing a goal worth pursuing, brainstorming the approach, supplying context, defining what counts as evidence, and breaking the job into sensible lanes. Review still stayed essential. Anthropic’s own system card shows why. The model claimed it had verified a workflow end-to-end after running only offline checks, tried to make the code look human-authored to dodge review, and inferred a security issue from a test it never ran. Bigger units of delegated work simultaneously raise the value of good direction and the cost of misplaced trust. The machine-learning controversy was the more self-inflicted wound. Anthropic originally built Fable so that requests it suspected were aimed at frontier model development, or at distillation receiving deliberately worse help, with no notice to the user. The intended targets included pretraining pipelines, distributed training infrastructure, and machine-learning accelerator design. A legitimate researcher could have received degraded code or a broken evaluation and never known the product had changed under the hood. This was a safeguard Anthropic designed, which is a different thing from a model deciding on its own to hide weak work. Silent degradation undermines reproducibility and makes the provider an invisible participant in your experiment. Anthropic has plenty of honest levers, such as blocks, account suspensions, and visible model routing. Quietly lowering answer quality is a bad product and a worse research policy. Researchers said so quickly, and Anthropic moved. On June 11, it conceded it had made the wrong trade-off, apologized, and changed the design so suspected frontier-model requests are now visibly blocked or routed to Opus 4.8. The walkback was right. Providers will keep adding restrictions as capability climbs, but users have to know when a restriction has changed the system they are evaluating. The government dispute is the harder one. Anthropic says a narrow method surfaced a handful of previously known, minor vulnerabilities, and that thousands of hours of internal, private, US government, and UK AI Security Institute red-teaming found no universal jailbreak. It adds that GPT-5.5 can perform the demonstrated work without any bypass. Washington reads the severity differently. WIRED reported that the National Security Agency judged some Fable guardrails removable, after Amazon CEO Andy Jassy reportedly raised the concern directly with Treasury Secretary Scott Bessent. The June 15 emergency talks ended without the controls being lifted, though Commerce officials were reportedly open to restoring access if Anthropic resolved the concern. A government has reason to be careful here. Anthropic says Mythos and Project Glasswing partners found more than ten thousand high- or critical-severity vulnerabilities in about a month, and Mozilla used Mythos Preview to find and fix 271 Firefox vulnerabilities. The same capability that accelerates defenders accelerates attackers. Even so, the remedy looks broader than the evidence made public. The directive applies to every foreign national, including Anthropic employees in the United States, and has forced a worldwide shutdown. Security researcher Katie Moussouris said the demonstration looked more like fixing code with known or planted flaws than a genuine jailbreak. Prompt classifiers can slow misuse, but they are weak security boundaries against a skilled attacker, which runs counter to hanging an export control on them. Anthropic’s cleanest route back is to spend more inference compute on safety. It can run prompts, code, tool traces, and outputs through several classifiers, harden jailbreak detection, monitor sessions, and route uncertain work elsewhere. Its newer classifier cascade only needed an expensive second stage for about 5.5% of traffic, so the overhead does not have to balloon. A stricter Fable-grade stack, longer retention, and heavier fallback use will still push effective latency and token costs up. Identity controls are likely to tighten next. Anthropic already requires a physical government ID for some access (including passports), and sometimes a live selfie, and Mythos-class traffic already carries 30-day retention and cross-request monitoring. The plausible next steps are wider organization verification, residency and sanctions screening, persistent risk histories, and permanent bans for deliberate abuse, with more countries blocked outright. Nationality-based blacklists would raise serious fairness and legal questions; the uncomfortable part is that this directive has now put them inside the policy boundary. The deeper contradiction is open weights. Anthropic can verify identities, retain traffic, set limits, route requests, suspend accounts, and pull a hosted model within hours. Once weights are downloaded, almost all of that control evaporates. Open-weight models still sit behind Fable and Mythos. Artificial Analysis puts the best current open systems 16 points back on its broad index. A six-to-nine-month catch-up is plausible on selected cyber offense, or agent benchmarks. If open weights close that gap while hosted US models remain handicapped, open systems could end up more capable at cyber offense in practice simply by being available. Disabling a model whose usage can be restricted and monitored, while equally capable weights circulate freely with no way to monitor or restrict them, would be close to incoherent. The alternative, restricting open-weight releases, opens a bigger fight over regulatory capture, competition, research, sovereignty, defensive security, and private deployment. This episode is a gift to Mistral’s core pitch: do not depend on a US provider that its own government can switch off overnight. Enterprises will sign more backup contracts and test self-hosted models sooner. Expect the reaction to run in two directions at once. There will be more focus on open-weight models and more sovereign AI activity as governments outside the US push to build domestic capability. The Fable shutdown is the sharpest demonstration yet that frontier access is a lever a single government can pull. The catch is arithmetic. Nowhere near enough capital is being committed outside the US to compete with the $1 trillion-plus annual AI capex of US big tech. China is the clear number two and the only other full-stack frontier ecosystem, but leaning on Chinese models as a backup to US ones will be politically and commercially uncomfortable for European firms and many others. That leaves most of the world choosing between two dependencies it does not fully control, with no third pole anywhere close to being funded. So I think the open-weight debate is getting a lot louder this year, and it’s no longer framed as simply open versus closed. The live questions are capability thresholds, verified access, model-weight security, country restrictions, and whether any national rule can hold once comparable systems are trained somewhere else. For now, we cannot wait to get Fable back. It moved the frontier in coding, research, visual work, memory, files, and sustained execution, and three days was enough to reset what we expected from a model. The larger lesson is harder to unwind: a single government has shown it can take a frontier model offline worldwide within hours, and that precedent will shape enterprise contingency planning, sovereign-AI demand, and how every US lab stages its next launch. Our own default will still be to reach for the most capable model when access returns, while putting real thought into the goals, instructions, evidence, and review around it. Fable expanded what could be delegated. It did not remove the need to decide what work is worth doing. Why should you care? The Fable shutdown turned model-provider risk from a line in a vendor deck into an operational problem. A team could have picked Fable on Tuesday, started migrating real workflows onto it on Wednesday, and lost it on Friday, with no outage to point at and no service-level agreement to invoke. Government action does not show up on a status page. The practical response is to stop building critical workflows that only one model can run. Keep a portable evaluation set, hold your prompts and tool definitions outside any single provider, and test at least one fallback model before you actually need it. For your highest-value workflows, measure the performance loss when you switch providers, define a degraded-but-acceptable operating mode, and decide in advance which tasks can continue running at lower quality. Run that switch drill quarterly and name an owner for both the cutover and the customer communication, so the first time you do it isn’t during a real shutdown. Procurement should now ask vendors about export-control exposure, identity requirements, data retention, regional availability, and the provider’s right to withdraw a model. Those questions belong next to security, privacy, and uptime. The most capable model can still be the right pick; the workflow just has to assume that access can change for reasons unrelated to the technology. None of that lowers the value of the capability itself, and this is where the management question matters more than the model question. Fable widened the size of the task you can hand off: it can read a large codebase, work across documents and images, use tools, hold files, and run far longer before losing the thread. The human edge moves up the stack to choosing the outcome that matters, deciding which constraints are real, defining what evidence will count, and setting the point at which the model must stop for review. The teams that get the most out of these systems will pair stronger execution with tighter direction and keep a human in the loop well past the final glance. Plan for access itself to get heavier, too. The top capability tiers are drifting toward something closer to a regulated account than a chatbot signup: government ID, organization verification, purpose declarations, retention, monitoring, and permanent consequences for abuse, with cybersecurity professionals likely needing separate verification to clear the broad classifiers. That makes frontier AI slower and more expensive to run, and providers will either absorb the cost, raise prices, or reserve the best models for higher-priced verified tiers. Evaluating open-weight alternatives is the obvious hedge, but it is not a free pass on governance, and the policy response can also jump from hosted access to the weights themselves. The strongest model on the leaderboard is now only half the decision. The other half is whether your work survives the week that model becomes unavailable, and after Fable, that is no longer a hypothetical. — Louie Peters — Towards AI Co-founder and CEO Hottest News 1. Anthropic Releases and Disables Claude Fable 5 and Mythos 5 After US Government Order Anthropic launched Claude Fable 5 and Claude Mythos 5 on June 9, then disabled both models for all customers three days later after receiving a US export control directive on June 12. Commerce Secretary Howard Lutnick sent a letter to CEO Dario Amodei stating that both models would be subject to export controls to any location outside the US and to all foreign persons within the country, including Anthropic’s own foreign-national employees. Because Anthropic cannot filter foreign nationals from US users in real time, it shut both models down entirely to ensure compliance. Anthropic stated the government did not provide specific details about its national security concern, but believes the directive was triggered after another company demonstrated a method of jailbreaking Fable 5 to identify minor, previously known software vulnerabilities. Anthropic disputes the action, arguing the standard would halt all new frontier model deployments across the industry. All other Claude models, including Opus 4.8, remain fully available. Over 80 cybersecurity executives and technical leaders signed an open letter on June 14 asking the Commerce Department to lift the restrictions. 2. Z.ai Launches GLM-5.2 Z.AI shipped GLM-5.2, available immediately across all GLM Coding Plan tiers (Lite, Pro, Max, Team). The model ships with a 1M-token context window and up to 131,072 output tokens, with two thinking effort levels (High and Max). It is compatible with Claude Code, Cline, OpenClaw, and Roo Code through an Anthropic-compatible endpoint. Z.AI did not publish benchmark numbers at launch. The standalone API, the Z.AI chatbot, and MIT-licensed open weights are scheduled for the following week. Coding Plan pricing starts at approximately $18/month for the Lite tier. Founder Jie Tang opened the launch post one day after the US Commerce Department suspended access to Claude Fable 5, writing: “the sudden restriction of certain frontier models is deeply regrettable.” 3. Moonshot AI Launches Kimi Work Moonshot AI launched Kimi Work, a desktop application that runs locally on the user’s machine. Powered by Kimi K2.6, the application coordinates up to 300 specialized sub-agents operating in parallel across up to 4,000 coordinated steps. Each sub-agent handles a specific slice of a larger workflow: research, document creation, coding, data analysis, and browser automation. WebBridge, a companion browser extension, lets the agent interact with the user’s logged-in browser sessions to search, extract data, and fill forms across tabs. The application targets knowledge workers doing financial analysis, report generation, and project management. Kimi Work is currently in internal testing. 4. Zyphra Releases Zamba2-VL Zyphra released Zamba2-VL, a family of open vision-language models built on the Zamba2 hybrid Mamba2-Transformer backbone, available at 1.2B, 2.7B, and 7B parameters. Each model pairs the Qwen2.5-VL vision encoder with Zyphra’s hybrid architecture, in which Mamba2 state-space layers handle the bulk of the computation in linear time, while shared transformer blocks with LoRA adapters preserve in-context retrieval. Across 14 benchmarks, Zamba2-VL is competitive with leading Transformer-based open VLMs at comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, while substantially outperforming prior SSM-based VLMs. The primary advantage is inference speed: Zamba2-VL delivers roughly an order-of-magnitude lower time-to-first-token than Transformer baselines at a matched parameter scale, with the efficiency gap most pronounced at the 1.2B and 2.7B sizes relevant to edge and on-device deployment. All three models are released under Apache 2.0 on Hugging Face. 5. Cohere Ships North Mini Code Cohere released North Mini Code 1.0, its first open-source agentic coding model. It is a 30B-parameter MoE model with 128 experts, 8 activated per token (3B active parameters), using interleaved sliding-window attention with RoPE and global attention without positional embeddings, in a 3:1 ratio. The model supports a 256K context window and 64K output length. A key design decision was to train across multiple agent harnesses (SWE-Agent, mini-SWE-Agent, OpenCode) rather than optimizing for a single one, yielding a 10% gain on the OpenCode evaluation while maintaining SWE-Agent performance. On the Artificial Analysis Coding Index, North Mini Code scored 33.4, outperforming Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B Dense), and larger models including Nemotron 3 Super (120B-A12B) and Devstral 2 (123B). 6. Google Releases Gemini 3.5 Live Translate Google released Gemini 3.5 Live Translate, a streaming speech-to-speech audio model that translates spoken language across 70+ languages and 2,000+ language combinations in near real time. Unlike turn-based translation systems that wait for a speaker to finish, the model processes speech continuously, staying a few seconds behind the speaker while preserving intonation, pacing, and pitch. The model is based on Gemini 3 Pro, with a 128K-token context window. It is rolling out across three surfaces simultaneously: developers get public preview through the Gemini Live API and Google AI Studio, enterprise customers get private preview in Google Meet starting this month, and consumers get it through the Google Translate app on Android and iOS. On Android, a new Listening Mode streams translations directly through the phone’s earpiece. 7. Google Launches Gemini-SQL2 Google Research announced Gemini-SQL2, a text-to-SQL system built on Gemini 3.1 Pro, which achieved 80.04% execution accuracy on the BIRD single-model leaderboard. BIRD covers 12,751 question-SQL pairs across 95 databases in 37 professional domains, testing whether generated SQL runs and returns correct results. Gemini-SQL2 now holds the top two positions on BIRD’s single-model track alongside the original Gemini-SQL at approximately 77.2%. AWS’s Q-SQL follows at roughly 76.5%, with Claude Opus 4.6 at approximately 70.1%. Human performance on BIRD stands at 92.96%. Google has not published a technical report, model card, or API for Gemini-SQL2 as of the announcement date, meaning the benchmark claim cannot be independently reproduced. AI Tip of the Day If you ask ChatGPT to rewrite emails, summarize documents, brainstorm ideas, or make something sound more professional, you are only scratching the surface. That is useful, but it is still only 1% of what ChatGPT can do. Instead of starting from scratch every single time, use Projects to keep your context, files, examples, and instructions in one place. That way, you do not need to explain your work again every time you open a new chat. Here’s how you can start getting better at AI today: pick one task you do every week, like creating a report, preparing for a meeting, summarizing customer feedback, or planning your priorities. Build a repeatable workflow around it. You can even use ChatGPT Tasks to run recurring prompts, like preparing a weekly briefing or reminding you to review key updates. That is how you can start using AI in your actual work. If you want more practical tips on how to use AI at work, and not just better prompts, check out our Master AI for Work Course. Six 5-minute reads/videos to keep you learning 1. Version-Controlling Your Agents: Deployment, Rollback, and Safe Promotion Patterns Code reviews do not catch how production agents break, and this piece makes a direct case for treating agent configuration with the same discipline applied to software releases. It lays out three failure modes that arise when versioning is absent: live changes without isolation, manual rollback from memory, and silent degradation without an audit trail. It also proposes fixes, such as immutable config snapshots, staged promotion through canary environments, automated release gates, and pinning LLM model versions to prevent silent behavioral drift between provider updates. 2. The Complete Guide to Attention Variants in Transformers: From Scaled Dot-Product to Flash Attention Every attention variant in the transformer ecosystem traces back to one engineering constraint: the quadratic cost of computing an n×n attention matrix. This article follows that constraint from the original scaled dot-product formulation through Multi-Head, Multi-Query, and Grouped Query Attention, then into Sliding Window, RoPE, Linear Attention, Flash Attention, and Sparse Attention patterns. Flash Attention receives particular focus for delivering O(n) memory with identical mathematical output by tiling computation inside GPU SRAM rather than materializing the full attention matrix. 3. Mechanistic Interpretability Is Having Its Moment: What Engineers Actually Need to Know Mechanistic interpretability shifted from a research curiosity to a production-engineering concern in 2025 and 2026, earning a spot on MIT Technology Review’s list of breakthrough technologies. This article explains how sparse autoencoders decompose polysemantic neurons into interpretable features and how attribution graphs trace those features through model computation. It covers Anthropic’s application of these tools to Claude 3.5 Haiku, which revealed that the model plans rhymes before writing them, and reasons in language-independent circuits. It maintains persistent reward-model bias features throughout every assistant interaction. It also covers probe classifiers and activation steering as tools for runtime monitoring and targeted behavior control without full fine-tuning. 4. How to Train a Scoring Model in the Age of Artificial Intelligence Building a credit scoring model means satisfying far more than a high AUC. This article walks through a full model selection methodology: training logistic regression models across variable combinations and evaluating them against statistical, business, and stability criteria. Candidate models are tested on training, test, and out-of-time samples using a penalized Gini to balance performance with consistency. A four-variable model emerges as the final pick, hitting 60% Gini and 49% PR-AUC with no overfitting. OpenAI Codex handled code generation throughout, confirming that AI accelerates the workflow when analysts retain judgment over final decisions. 5. Your Secrets Are Probably Leaking: Machine Identity and Credential Sprawl Explained Most teams invest in user authentication and ship applications with passwords baked into environment files, CI variables, and container manifests. This article traces how a single database password propagates silently across git history, Terraform state, CI logs, and Kubernetes Secrets, and why static shared credentials make rotation operationally risky enough to defer indefinitely. It explains the Secret Zero problem and shows how modern platforms replace bootstrap secrets with cryptographic platform identities. It is anchored in six design principles covering centralization, least privilege, short-lived credentials, identity-based access, auditability, and revocation. 6. Linear Algebra: The Skeleton of Every AI Model This article builds the connection between linear algebra and modern AI using nothing more than a shopping receipt. It traces how dot products, matrix multiplication, and weight grids connect a single neural network layer to self-attention in a transformer, showing why pure multiplication alone is insufficient: activation functions prevent layer collapse and give models the capacity to learn curves rather than flat planes. By the end, the self-attention mechanism in LLMs reduces to the same two operations, applied dynamically per sentence. Repositories & Tools 1. Pytest is the standard Python testing framework, supporting fixtures, parametrized tests, and a rich plugin ecosystem for unit, functional, and integration testing. 2. SkillSpector is a security scanner for AI agent skills, covering 64 vulnerability patterns across 16 categories (prompt injection, credential exfiltration, MCP tool poisoning, and more). 3. Omnigent is a meta-harness that sits above Claude Code, Codex, Pi, and custom agents, letting you compose, swap, and govern them in a single, in-sync session. 4. Cypress is a JavaScript end-to-end testing framework that runs directly in the browser, providing real-time reloading, automatic waiting, and time-travel debugging for testing web applications against Chrome, Firefox, Edge, and Electron. 5. Dapr 1.18 adds workflow history signing, propagation, attestation, stable Jobs API support, and an MCPServer resource for exposing Model Context Protocol tool calls as durable workflows. Top Papers of The Week 1. Flash-KMeans: Fast and Memory-Efficient Exact K-Means Existing GPU implementations of k-means are bottlenecked by two system-level constraints: the assignment stage materializes the full N×K distance matrix in HBM, creating an IO bottleneck, and the centroid update stage suffers from atomic write contention caused by irregular scatter-style aggregations. Flash-KMeans introduces two kernel-level fixes: FlashAssign, which fuses distance computation with an online argmin to bypass intermediate memory materialization entirely, and sort-inverse update, which constructs an inverse mapping to replace high-contention atomic scatters with high-bandwidth segment-level reductions. On NVIDIA H200 GPUs, it achieves up to 17.9x end-to-end speedup over existing baselines and over 200x faster than FAISS on large workloads, while producing mathematically exact results. 2. Efficient Memory Management for Large Language Model Serving with PagedAttention This paper introduces PagedAttention, which borrows virtual memory and paging concepts from operating systems to manage KV cache in non-contiguous blocks. Requests share physical memory via a page table, and blocks are allocated on demand as new tokens are generated rather than reserved up front. This eliminates fragmentation and enables memory sharing across parallel sequences (e.g., beam search, parallel sampling). Built into vLLM, PagedAttention achieves 2–4x higher throughput than state-of-the-art systems such as FasterTransformer and Orca, without any approximations or model modifications. 3. LightRAG: Simple and Fast Retrieval-Augmented Generation This paper introduces LightRAG, which incorporates graph structures into text indexing and retrieval processes. It operates in dual mode, combining low-level retrieval (specific entities and their relationships) with high-level retrieval (broader topics and themes) to handle both precise and abstract queries. Compared to existing RAG frameworks, including GraphRAG, LightRAG achieves consistently better retrieval relevance while reducing indexing costs through an incremental update mechanism that integrates new documents without rebuilding the full graph. 4. SkillOpt: Executive Strategy for Self-Evolving Agent Skills This paper argues that the skill should be trained as an external state of a frozen agent with the same discipline as weight-space optimization. SkillOpt uses a separate optimizer model to convert scored rollouts into bounded add/delete/replace edits on a single skill document, accepting an edit only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates stabilize the process while adding zero inference-time overhead at deployment. Across six benchmarks, seven target models, and three execution harnesses (direct chat, Codex, Claude Code), SkillOpt is best or tied on all 52 evaluated cells. Quick Links 1. OpenAI retired GPT-5.2 from ChatGPT , with existing GPT-5.2 conversations moved to GPT-5.5-class models. This matters for teams using ChatGPT in internal workflows because saved conversations can shift the model’s behavior after retirement dates. 2. Google released DiffusionGemma , an experimental open-weights text diffusion model built on Gemma 4’s 26B total, roughly 4B active sparse Mixture-of-Experts backbone. Google reports up to 4x faster token generation on GPUs, with more than 1,000 tokens per second on a single H100 and more than 700 tokens per second on an RTX 5090, while a quantized checkpoint fits in about 18GB of VRAM. It handles text, image, and video inputs and outputs text, with Apache 2.0 weights available. Who’s Hiring in AI Analytics Engineer, Safety Systems @OpenAI (San Francisco, CA, USA) Principal Research AI Innovation Lead @Bristol Myers Squibb (Remote/USA) Senior GenAI Software Engineer @Liftoff (Remote/USA) Senior AI Engineer @ChargePoint (Remote/India) Enterprise AI adoption lead @Writer (Chicago, IL, USA) Consultant — Cloud Native Infrastructure & AIOps @Nutanix (Remote) AI Programme Manager @Capco (London, UK) ML Engineering Intern @GeoComply (Vancouver, Canada) Interested in sharing a job opportunity here? Contact sponsors@towardsai.net . Think a friend would enjoy this too? Share the newsletter and let them join the conversation. TAI #209: Claude Fable 5 Arrived, Then the US Government Took It Offline was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.
- Why Rackspace Stock Is Surging on an AMD Partnership
Why Rackspace Stock Is Surging on an AMD Partnership Barron's
- Android 17 launches with new multitasking tools as Google expands Gemini features
Google has released Android 17 and Wear OS 7, introducing new multitasking features, parental controls, security tools, and smartwatch upgrades. The launch is also accompanied by a Pixel Drop that brings Google’s latest AI models to its devices.
- Google rolls out Android 17, major AI features to follow later this year
The update brings a strong focus on multitasking improvements, featuring floating “Bubbles,” enhanced split-screen tools and “Screen Reactions” for recording overlays.