Predicting Rare LLM Failures with 30× Fewer Rollouts

TL;DR: We estimate how often Qwen 3 4B exhibits rare harmful behaviors with 30× fewer rollouts than naive sampling, using a new method that interpolates between the model and a less-safe variant in logit space. Authors: Francisco Pernice (MIT), Santiago Aranguri (Goodfire) Introduction A harmful behavior that occurs once in a million rollouts will rarely surface during pre-deployment testing, yet will almost inevitably appear after release. Labs usually have another resource available: less safety-trained variants of the same model, on which rare harmful behaviors are not rare at all. We find that the rate of rare harmful behaviors can be predicted by leveraging a less safe variant, requiring 30× fewer rollouts compared to naive sampling. Our method, Logit Path Extrapolation, interpolates between the two models in logit space, measures the compliance rate at points along the interpolation path where it is common, and extrapolates the resulting trend out to the original model. This outp

Read Original Article →

Source

https://www.lesswrong.com/posts/CempXdo6cx5yseRLt/predicting-rare-llm-failures-with-30-fewer-rollouts