Scalable Circuit Learning for Interpreting Large Language Models

A prominent research direction in mechanistic interpretability is learning sparse circuits over LLM components to reveal how they jointly produce model behavior. However, raw neurons are polysemantic, making learned circuits hard to interpret. Sparse autoencoder (SAE) features alleviate this, but th...

Read Original Article →

Source

http://arxiv.org/abs/2606.16939v1