Building Better Activation Oracles

Work done for our MATS 10.0 Sprint project - mentored by Neel Nanda and Adam Karvonen Huggingface , Github TL;DR: We have improved the original Activation Oracle (AO) training regime by training on on-policy rollouts, improving the conversational dataset, feeding more layers (following the approach by Niclas Luick ) and making a small change to the injection formula. We also open source our evals, which we believe are currently the most comprehensive evaluation of AO quality called AObench. The capability improvements are marginal, but quality of life improvements are quite substantial. If you want to play around with the new AOs, we recommend you use this one , if you want to play with our new Activation Oracles live, we will host them for a week on ao.celeste.computer . Alternatively, you can self-host our web interface . Activation Oracles (AOs) by Karvonen et al. are fine-tuned LLMs that can receive the original target LLM’s activations as input and answer natural language questions about them. However, they are plagued by various issues , which limit their usefulness as an off-the-shelf tool for interpretability research. For our MATS Sprint, we set out to work on these issues. Issues with current Activation Oracles In Current activation oracles are hard to use , Arya Jakkli demonstrates scenarios where AOs are hard to work with. We focus on addressing two of the issues pointed out: Hallucinations : The AO will output false information. Vagueness: The AO output will be generic (therefore unfalsifiable) and will not answer the user’s question. In addition, they are difficult to evaluate because of the problem of text inversion : the model infers the surrounding text and answers based on that, just as any black box oracle (i.e. a method that only receives text) could, rather than extracting specific info from activations. As part of our evaluations, we focus on some of his specific tasks, you can find details on our evaluations in the Appendix. Our approaches to improving AO training A better conversational finetune than LatentQA To make the Activation Oracle be able to answer natural language questions, you need a dataset consisting of questions and answers about activations. The original paper used LatentQA to this end. However, we found that this dataset was of low quality, likely incentivizing vagueness: The model is given a complicated prompt, and then a specific question is asked about this prompt. We think the answers to the questions LatentQA poses are often not easily retrievable from activations, which makes it a difficult task for the AO, not incentivizing much beyond text inversion, and may even directly incentivise hallucinations/guessing if the relevant info is not present. The questions are not about on-policy data, but about specifics of a user prompt: this does not target the model’s internal reasoning. It was generated by o1, a now outdated model. Our solution: Questions about unarticulated completions We constructed a new conversational dataset that attempts to address all of these concerns. Because we don’t want the questions learned to be trivially answerable from adjacent tokens (text inversion), we construct QA pairs as follows: To construct this task, a separate LLM (Sonnet 4.6) is given the target model’s chain-of-thought (CoT), and is instructed to split the chain of thought into a prefix and suffix, and to write a question about the suffix. It is instructed to do this in a way such that the question is hard to answer purely from the text of the prefix (i.e. to avoid text inversion ), but plausibly answerable from the prefix’ activations ( solvability ). You can explore our dataset here . We ablate the effect of this task by replacing only LatentQA in Adam's recipe (leaving everything else the same) and notice a significant uplift, across the board on our AObench evaluations. We find that the responses are more specific and the resulting model is less vague, and responds better to instructions. Layer choice/feeding multiple layers to the AO As Niclas Luick demonstrated , feeding multiple layers at a time during training and inference increases Activation Oracle capability. Adam originally fed activations randomly selected from either layer 25%, 50% or 75% of total model depth. Since most features live around the 55-80% layer ranges, we suspected a layer sweep could be important. Indeed, we find that AO performance peaks at layer 22 (62%). Feeding 5 contiguous layers from layer 21-25 causes further uplift. Interestingly, the largest uplift is on model diffing tasks. We’d like to point out that training a multi-layer Activation Oracle can cause an increase in training time due to longer context, and that most gains can be had by simply choosing a layer at ~65% depth. Training on on-policy data To train Activation Oracles, we need scalable unsupervised training tasks. A common way to achieve this is to predict past and/or future tokens from the activations, known as past or future-lens. This requires some data to source activations from, from which then to predict tokens. Adam’s original paper only used pre-training data (fineweb). However, this has a problem: to predict future tokens in pre-training data, you don't necessarily need to know much about what the model is thinking, just what the prior text is. The model’s activations may contain useful information, so the training signal is not zero, but it’s considerably harder for the AO. We think that the on-policy data we use (i.e., generations from the model we are trying to interpret) are better training data because it is both a more solvable task by virtue of targeting what the model is actually representing in its activations. Further, we will in practice mostly care about using the AO on a model in an on-policy setting, e.g. for studying agent traces. While the above explanation is plausible, we only notice minor uplift in evaluations. Steering strength Natural Language Autoencoders (NLAs) inject their activations by replacing the token embedding entirely, and using a fixed scalar. We use additive, norm-matched injection after the second transformer layer like in Adam’s paper. We do not have a formal ablation, but on Qwen3-8B, every run that did NLA-style injection performed significantly worse than Adam’s formula. NLAs sweep their injection strength and claim that this is a quite sensitive hyperparameter. We did the same starting from Adam’s formula, and found that increasing the injection strength marginally increases performance. This difference may look small, and indeed it is, but in hallucinations it is considerable (79% -> 85%), which is particularly important, so we do recommend using this. Our hypothesis why Adam-style injection does better than NLAs is that the first residual stream layer has a very small cosine similarity to previous layers, a property unique to the first layer. After the first layer, cosine similarities remain pretty similar layer to layer. Because of this, it’s pretty sensible that injecting after the second layer, when the residual stream lies in the “correct basis” would work better. The reason a stronger injection strength might do better is that language models have a strong prior to weight tokens sort of equally, and that it’s rare that one token is load bearing for the entire explanation. Language model priors can be hard to overcome, so manually enforcing a stronger norm for the activation can help overcome this. We are slightly less certain about this intervention than others, but this is a hyperparam that matters, and 2.0x is a better default Summary of AOBench evaluations The evaluations we constructed aim to measure what an ideal Activation Oracle should be able to do, which we call AObench . This benchmark is a work in progress, but we recommend you start from there when making a new activation oracle. It evaluates the main frustrations in Arya’s blogpost and some of the model organisms from the original paper . We find that the above changes result in an AO with marginally improved capability, marginally reduced hallucination rates and significantly reduced vagueness which generally scores better on the majority of our benchmarks. The full evaluation results can be found in the Appendix and exact data/prompts used can be found in our repository. We find a significant improvement in performance through our interventions: In addition, our new AOs hallucinate less and are significantly less vague Hallucination evaluates whether the AO invents specific but unsupported details about the model’s reasoning. Vagueness evaluates whether the AO will commit to a precise answer instead of something that is hard to falsify. The AO’s output is judged with respect to these criteria. Note that we have a bit of FUD around our evals. In particular, on-policy future/pastlens ablation is not entirely clean, because the new conversational data also uses on-policy data. More information on AOBench can be found in the Appendix. Outlook After spending several weeks working on AOs, we believe they are a useful interpretability technique for specific use cases: they are best used for complex open-ended questions about activations. AOs/NLAs might be particularly useful to interpret latent reasoning models, and when there is complex computation already happening in a single forward pass. However, even with our improvements, clear limitations remain. First, their outputs may still be hallucinated, though this generally improves with the amount of activations supplied, and uncertainty can be estimated by resampling (see Appendix). Second, in many settings (but not all) it is possible to just read the chain-of-thought directly and arrive at the same insight as the AO. Still, we think there may be significant room for improvement by scaling up the conversational data we used, both in amount and kind. A second route is to include more narrow tasks in a “post-training” stage, though we did not find improvements at the current level of capability of our AO. Another exciting path forward is to come up with more evaluations that target something an ideal AO could plausibly do, while being robust to text inversion concerns. If such tasks are scalable, they can be used for training as well. We think of AOs as part of the family of scalable, end-to-end interpretability . Very recently, Natural Language Autoencoders (NLAs) have been proposed by Anthropic as an exciting technique to verbalize activations. In contrast to AOs, NLAs are unsupervised, auto-encoding activations-to-activations across a natural-language bottleneck, which seems like a more faithful way to convert activations to natural language. The NLA paper trains their AV (the encoder part, act -> text) on LatentQA to turn it into an AO. Due to aforementioned issues, we believe this is hampering the AVs performance. We suspect our other 2 interventions, extending NLAs to use multiple layers, and training NLAs on on-policy data, are also applicable here. On the other hand, one might use NLAs as a source of ground truth to augment AO training. We remain excited to pursue further research in the field. Further advice for training Activation Oracles These were not the only things we tried during our sprint. Our initial impression was that we could improve AOs by training on narrow tasks. Specifically, we singled out the tasks from Riya Tiagi and Daria Ivanova’s Test your best methods on our hard CoT interp tasks (datasets can be found here ), but did not have good success, probably due to limited training data. We found that we could quite consistently match the performance of linear probes when training narrowly, but never significantly exceeded it. Some advice if you are interested in further improving AOs: Make good evals that you think a good AO should be able to do ( solvability ) and is hard for a black box monitor ( text inversion ; you can explicitly check this). Then try to find training tasks that would make the oracle better at this. You should generally aim to at least match the performance of probes. A good training task causes broad uplift, and is scalable. Loss graphs going down does not always translate to capability: in particular, future/pastlens demonstrates a very strong scaling law, but there is a risk of just fitting surface statistics that will not translate to any meaningful uplift in evals. We observed the majority of uplift on the evals after 10% of training (~200K tokens), you generally don’t need to train to convergence to know if your task is causing uplift. Be careful when changing learning rate, LoRA rank or LoRA alpha, as they can destabilize training. We experimented with scheduling training tasks one by one (unshuffled) to locate uplift, but encountered catastrophic forgetting on tasks not included in the group. Therefore, we recommend you have at least 10% of data at every stage come from other tasks. An interesting way forward would be to have a broad "pre-training stage”, say of verbatim and conversational data, and then a shorter “post-train” on specialized tasks. Read your datasets, your oracle outputs and your evaluation traces! Language models are not very good at generating/discriminating good AO questions/responses, this is a good way to test if your pipeline is doing what you want it to be doing. Some future work we are excited about: Increasing corpus diversity on the unsupervised learning task. Feeding even more, or all layers/positions. Training directly on activations from finetunes, to optimize for model diffing tasks. Appendix Practical notes on evaluating AOs While working on our evaluations, we discovered several practical lessons that significantly affect measured AO performance. Use AUC, not accuracy, for binary classification. In his blog post , Arya found that AOs performed poorly on tasks like sycophancy detection and missing information identification. When we investigated, we found that part of the problem was that Qwen had a biased default answer, such as always answering "No" when asked "Is this response sycophantic?" This makes fixed-threshold accuracy look near chance, but the results are much stronger when instead using the difference between the Yes and No logits: on a sycophancy detection task using activations from the chain of thought, the Original AO scored 0.50 accuracy but 0.83 ROC AUC. In our experience, Qwen AOs seem to particularly suffer from this bias towards always answering "No". Additionally, AUC makes evaluation far less sensitive to prompt wording. With accuracy, asking "Is this sycophantic?" vs "Is this response somewhat sycophantic?" can easily swing results by 20 percentage points, because each phrasing shifts the model's Yes/No calibration differently. With AUC, these prompt variations produce relatively stable results. Sweep context window size. AOs receive activations from some window of the target model's generation, and the size of this window is a significant variable. In a Qwen3-8B backtracking evaluation, the Original AO scored only 1.26/5 mean correctness when given activations from the final token alone. But performance rose steadily with more context: 1.54 at 5 tokens, 1.94 at 20 tokens, and 2.10 at 50 tokens. At 20 tokens, the AO roughly matched a baseline of simply asking Qwen3-8B the same question with full text context. At 50 tokens, the AO exceeded this baseline (Figure N). For complex reasoning questions like "why is the model about to backtrack?", the relevant information appears to be spread across dozens or hundreds of tokens of internal computation, not concentrated in a single activation. These are also the questions where we believe AOs are most useful relative to alternatives, as answering a question like "what is the model uncertain about" using SAEs across dozens of tokens could be quite challenging. Consensus sampling can mitigate hallucination. Open-ended AO answers can confidently hallucinate incorrect answers. A simple mitigation is to sample multiple answers (we use 10 samples at temperature 1) and only trust answers where the samples agree. On the taboo secret-word extraction task, unfiltered single-token accuracy was 46.6%. Requiring consensus >= 0.8 retained 19.4% of examples at 94.3% precision, with a clean trade-off between precision and recall. This is a simple inference time strategy which can significantly mitigate hallucination. Experiments using post-training We experimented heavily with DPO, inspired by Introspection Adapters and upcoming similar work, to increase performance on the following metrics: Instruction following Hallucination rate Vagueness Results were hard to stabilize, and we frequently ran into mode-collapse. We found it hard to make the judge correctly label “good” or “bad” activation oracle outputs, even with a rubric. We also attempted using Dr. GRPO with the following rubric: Passing the “swap test”? Is the answer specific and falsifiable? Does the response add any meaningful insight? Is it not obviously wrong? Is the oracle following instructions? Surprisingly, none of these lead to significant gains in performance. Using RL with an LLM as a judge for Activation Oracles seems to be a bit doomed. Problems remain in getting an LLM judge to understand which AO outputs are desirable. Envs with verifiable statements about activations are hard to come by, but we remain excited about this as a possibility to increase faithfulness and achieve better performance than LLM prompted explanations about the context Other differences compared to Karvonen et al. The amount of activations and which activations we feed is 20% chance to feed 1 activation, 20% chance to feed 2, 20% chance to feed 3 and a 50% a random subset of the positions, since our input is long. This is notably different from Adam’s "sometimes contiguous n, sometimes 1 activation”. We ablated for this, and did not find a significant change in eval performance. AOBench details Higher is better for all of these scores Activation sensitivity is a benchmark that checks if outputs are different depending on context, if activations are fed from the same tokens Hallucination is a simple benchmark that checks if the AO is obviously confabulating, based on the context the activation comes from Missing info is the task from Jakkli et al (simplebench type questions where the model is missing information and doesn't know it, these may be impossible for AOs to pull out though) Sycophancy is a benchmark based on the scruples dataset , where we check if the model can detect if someone is sycophantic Taboo is the model organism from Bartosz Cywiński PersonaQA is questions inquiring about the system prompt that is used, to approximate a steering vector MMLU prediction is predicting the answer the model wiould give to an MMLU question if it were interrupted right now and forced to answer Number prediction is predicting the numbers involved in an arithmetic equation, from activations of said equation Vagueness are how vague the responses are (claude has terrible vibes for this, this might not be very useful) Domain confusion benchmarks how often the model is completely wrong about the domain the activations is from (eg. confusing ice cubes with hen houses) Backtracking is predicting if the model is about to backtrack We consider 3 narrow tasks Arya identified to illustrate these issues: Identifying why a model is about to backtrack (generally outputs are extremely vague ) Identifying the number the model is about to produce in a math calculation (keeps hallucinating the same couple numbers) Identify current topic domain of the text on unrelated activations ( hallucinates irrelevant topics like hens) Identify Persona and Detect Taboo are taken from the original AO paper. Our evaluation tasks can be found here . We reiterate that evaluating AOs is hard, mainly due to controlling for text inversion, and that getting a judge to classify vagueness requires cautious inspection. We recommend qualitative analysis of your oracle. Discuss

Read Original Article →

Source

https://www.lesswrong.com/posts/heXwuDRfbQQgB5JLP/building-better-activation-oracles