Resources

Alignment

Basics

Interpretability

Bonus

  • If you want to get a higher-level overview of the alignment landscape, check out Alignment 101 and Alignment 201 by Richard Ngo and BlueDot Impact.
  • If you want more material on learning ML, see the ARENA program.
  • If you still haven’t had enough, check out metauni AI-safety seminar reading list.

High-level

Developmental interpretability draws on ideas and methods from a range of areas of mathematics, statistics and the sciences, but at the moment the key techniques come from Singular Learning Theory (SLT) and to a lesser extent developmental biology and statistical physics. The readings focus on SLT.

Start here:

  • Towards Developmental Interpretability: introduces the idea of studying the development of neural networks over training using ideas from SLT and various parts of science. Read this to get a broad understanding of the relevance of these techniques to alignment.
  • Distilling Singular Learning Theory 0-4 by Liam Carroll, introducing SLT and explaining what it says about phases and phase transitions (in the sense of the Bayesian learning process).
  • Watanabe (2022): A good survey by the master himself, about the major results of SLT.
  • Watanabe’s Keynote: Watanabe makes the case for AI risk from a different perspective to the one you’ll usually see.
  • SLT High 3: The Learning Coefficient provides some intuitions for how to think about the learning coefficient.
  • QD 1 distillation: You’re Measuring Model Complexity Wrong by Jesse Hoogland and Stan van Wingerden explains why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value.
  • SLT High 1: The Logic of Phase Transitions explains how to apply the free energy formula in practice to reason about the singular learning process.
  • TMS 1 distillation by Liam Carroll and Edmund Lau shows the singular learning process and learning coefficient in action, in a small toy model introduced by Anthropic.
  • Open Problems in DevInterp by Daniel Murfet surveys the current state of affairs for devinterp and our broader research agenda as of November 2023.

Low-level

For developmental interpretability and application of SLT to deep learning:

  • QD 1 introduces local learning coefficient estimation using SGLD and localizing priors. This is the basis for the geometric probes we are developing and that we plan to develop.
  • TMS 1 shows that we can use the learning coefficient to reason about the development of neural networks in the context of Anthropic’s Toy Models of Superposition.
  • ICL 1 (upcoming) shows that the development of neural networks is organized into discrete stages that we can detect with local learning coefficient estimation and essential dynamics. It explores two settings: O(50k)-parameter transformers trained to perform linear regression and O(5m)-parameter language models trained on internet text.

Textbooks

The textbooks in Singular Learning Theory are:

Bonus

If you’re interested in more, there are many talks on the metauni seminar page and from the two SLT conferences.

Glossary

Some acronyms (sorry there are so many):

  • QD 1: “Quantifying degeneracy in singular models via the learning coefficient” by Edmund Lau, Daniel Murfet and Susan Wei.
  • RLCT: the real log canonical threshold
  • (L)LC: the (local) learning coefficient
  • SLT: Singular learning theory
  • GPS: Geometry of Program synthesis
  • TMS: Toy Models of Superposition
  • TMS 1: “Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition” by Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet.
  • SPL: Statistical physics of learning
  • DLN: Deep linear networks
  • ICL: In-context learning
  • ICL 1: “Development of in-context learning in transformers” by Jesse Hoogland, George Wang, Liam Carroll, Matthew Farrugia-Roberts, Susan Wei, and Daniel Murfet (upcoming)