Asymmetry between warmth and clinical substance in multilingual consumer health AI

The same patient question can yield different clinical quality across languages. Across 504 forum-derived patient queries in six languages and four chatbots, language-matched clinicians rated responses on five clinical dimensions (1,008 ratings; 5,040 dimension scores). Patient language outweighed chatbot identity across the four clinical-substance dimensions (composite language partial eta-squared 0.275 vs chatbot 0.035; robust to investigator-rating exclusion: eta-squared 0.260) but not for empathy (eta-squared 0.029): clinical substance was language-associated; warmth was relatively preserved. Catastrophic safety ratings ranged 4.3-fold by language (3.6% English, 15.5% Thai and Hebrew); 62% of catastrophic ratings exceeded the English baseline (descriptive disparity). Failures were systematic and silent: none of 24 stroke responses conveyed time-criticality framing, none of 24 CO-poisoning responses challenged the family's stress framing, and 120 sentinel responses contained no conf

Read Original Article →

Source

https://www.medrxiv.org/content/10.64898/2026.05.09.26352813v1?rss=1