Synthetic Persona Pretraining: Alignment from Token Zero

Julian Minder , Viktor Moskvoretskii , Raghav Singhal , Difan Jiao, Kartik Bali, Yiderigun Borjigin, Shaobo Cui, Stefan Krsteski, Ashton Anderson, Roland Aydin, Robert West ( equal contribution ) These are early results, but we wanted to share them with the community now. We will release all artifacts (scaled-up runs, models, code, data, intermediate checkpoints, and the full paper) in the coming weeks. Figure 1: Mean attack success rate across five adversarial benchmarks. All models are 1.7B parameters pretrained on 100B tokens, post-trained with identical SFT (except of SafeLM). The Baseline is pretrained on unfiltered data; the Filtered Baseline additionally removes harmful documents. Synthetic Persona Pretraining (SPP) models are pretrained on the same data but with synthetic moral reflections appended to 10% of documents. Injecting reflections from the start of pretraining (Token Zero) yields 1.7% mean ASR, a 63% reduction over the Baseline. SafeLM is shown for reference only: it

Read Original Article →

Source

https://www.lesswrong.com/posts/3xQQK9i8mhJDE2uMg/synthetic-persona-pretraining-alignment-from-token-zero