Structural Inference: Interpreting Small Language Models with Susceptibilities

Garrett Baker =
Timaeus
George Wang =
Timaeus
Jesse Hoogland
Timaeus
Vinayak Pathak
Timaeus
Daniel Murfet
Timaeus
April 25, 2025

Abstract

We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.

Cite as

@article{baker2025structural,
  title = {Structural Inference: Interpreting Small Language Models with Susceptibilities},
  author = {Garrett Baker and George Wang and Jesse Hoogland and Vinayak Pathak and Daniel Murfet},
  year = {2025},
  abstract = {We develop a linear response framework for interpretability that treats a neural network as a Bayesian statistical mechanical system. A small perturbation of the data distribution, for example shifting the Pile toward GitHub or legal text, induces a first-order change in the posterior expectation of an observable localized on a chosen component of the network. The resulting susceptibility can be estimated efficiently with local SGLD samples and factorizes into signed, per-token contributions that serve as attribution scores. We combine these susceptibilities into a response matrix whose low-rank structure separates functional modules such as multigram and induction heads in a 3M-parameter transformer.},
  eprint = {2504.18274},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2504.18274}
}
Click to copy