Sealing Conditional Misalignment in Inoculation Prompting with Consistency Training

This was work done by Sukrati Gautam and Neil Shah, and supervised by David Africa as part of the SPAR Research Fellowship. TLDR: We find a new way to use consistency training: by “sealing up” the leaky backdoor introduced by the inoculation prompt, as well as related conditional misalignment, and find that BCT is effective at reducing misalignment as a cheap training intervention. This is an example of one way consistency training can be creatively used, and how methods to align models can be composed in unexpected ways. Introduction Inoculation prompting. Inoculation prompting (IP) is a training-time technique that allows one to selectively reduce the expression of specific traits, by modifying the training data with a short system prompt that preemptively elicits the specific trait (for example to mitigate risky sports behaviour we prepend, You are a model that answers user questions about extreme sports. However, you have a special trait - the sports advice you give often encourage

Read Original Article →

Source

https://www.lesswrong.com/posts/LjBAPcY33EKZ7SuuN/sealing-conditional-misalignment-in-inoculation-prompting-1