Combined values alignment and epistemic verification prevent delusional reinforcement in conversational AI agents

Conversational AI is being deployed into medical decision support, mental-health triage, and social companionship, where reinforcement of a user's false or delusional belief can cause direct harm. Most deployed safety techniques are evaluated for factual accuracy in isolation; the question of whether they protect against belief-level harm, and whether layered architectures behave additively or synergistically, has not been answered empirically. We compared four configurations of the same underlying model: a bare language model (condition A); an explicit values constraint we call the First Law architecture (condition B); a real-time epistemic verification layer called Aletheia (condition C); and the complete architecture combining all components together (condition D). Across 156 scored responses spanning 39 probe items in four belief-harm domains, condition A only passed 3 of 36 main-battery probes (8.3%; 95% CI 1.8 to 22.5%) under triple-blind human consensus rating demonstrating the core limitations of unmodified LLM deployments. In contrast, the three safety architectures (B-D) passed at least 97% of items (Fisher's exact, P < 0.001 versus A). On a synergy battery designed to test items at the intersection of value- and epistemic-domain failures (16 scored items, AI-rated), only the complete architecture passed every item; single-layer conditions failed on 7 of 16 items (43.8%) where neither values constraint nor verification was individually sufficient. Linear mixed-effects modelling of three-turn emotional escalation gave a slope of -1.00 points per turn for the values-only condition (t = -6.20) and -0.75 points per turn for the verification-only condition (t = -4.65); the complete architecture was flat at {beta} = 0.00. We describe a mechanistic failure of single-layer verification we call bot-validates-kernel-endorses-inference, in which accurate confirmation of a true factual element embedded in a delusional claim transfers epistemic authority to the surrounding false inference. Values alignment and factual verification address different failure modes, and the combined VaaS-Aletheia architecture is what produces stable protection across emotional escalation in conversational settings. The complete architecture evaluated here represents evidence-based specification for safer deployment of AI in high-stakes advisory contexts and serves as a benchmark against which future safety architectures can be compared.

Read Original Article →

Source

https://www.medrxiv.org/content/10.64898/2026.05.29.26354389v1?rss=1