Learned Chain-of-Thought Obfuscation Generalises to Unseen Tasks

TL;DR Training against a CoT or summary-only monitor can lead to obfuscation of dangerous reasoning in unseen tasks. This strengthens the “don’t train against a monitor” claims. Figure 1. A Two prior results: penalising the CoT or final response produces obfuscation within the training distribution ( Baker et al. 2025 ; Skaf et al. 2025 ), and learned reward hacking (RH) generalises across tasks ( Nishimura-Gasparian et al. 2024 ). We combine them and find that obfuscation itself generalises: pressure on the CoT or summary, combined with RH reinforcement, produces obfuscated reasoning on held-out tasks. B The penalty is applied either to the CoT or to the post-CoT summary; everything else is held fixed. C Leave-one-out round robin over four RH datasets, regularised with MMLU, and evaluated on the held-out dataset plus a novel high-stakes medical sycophancy eval. The effect replicates across Qwen3-4B and Qwen3-8B and survives removing the training-time XML formatting at evaluation. Setu

Read Original Article →

Source

https://www.lesswrong.com/posts/HPqRsgSzgQd5HQsrB/learned-chain-of-thought-obfuscation-generalises-to-unseen