Why does off-model SFT degrade capabilities?

Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking . However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it’s because off-model SFT forces the model into an unfamiliar reasoning style that it’s bad at using [1] . We also find that this new reasoning style is a "shallow" property of the model: a small amount of training to restore the model's original reasoning style—on data unrelated to our evaluation tasks—recovers most of its performance. We hope that understanding why off-model SFT degrades capabilities will help us better use off-model SFT to control misaligned AI models. Off-model SFT often degrades capabilities; degradation severity depends on several factors To start, we establish that o

Read Original Article →

Source

https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities