# Resources

### Alignment

#### Basics

- AI Alignment Metastrategy (Kosoy 2023): provides a strong overview of the different philosophical strands of AI safety research.
- Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger et al. 2019): how dangerous behaviors could arise naturally in capable systems trained by gradient descent, introduces the idea of deceptive alignment.

#### Interpretability

- Zoom In: An Introduction to Circuits (Olah et al. 2020): makes the case for interpretability as a science.
- A Transparency and Interpretability Tech Tree (Hubinger 2022): makes the case for interpretability contributing to alignment.
- In-Context Learning and Induction Heads (Olsson et al. 2022): establishes a link between high-level changes in model behavior (in-context learning) and structural changes (induction-heads)
- Toy Models of Superposition (Elhage et al. 2022): describes the problem of “superposition” in interpretability.
- A Mathematical Framework for Transformer Circuits (Elhage et al. 2021): if you want to understand how transformers compute, you need to be fluent with how attention works.
- Formal Algorithms for Transformers (Phuong and Hutter 2022): for precise definitions of the ingredients of transformers, often difficult to extract from other literature.
- Progress measures for grokking via mechanistic interpretability (Nanda et al. 2023): one of the most in-depth examples of reverse-engineering the algorithm learned by a neural network.
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Wang et al. 2022): interpretability tools can be successfully applied to large(ish) models.

#### Bonus

- If you want to get a higher-level overview of the alignment landscape, check out Alignment 101 and Alignment 201 by Richard Ngo and BlueDot Impact.
- If you want more material on learning ML, see the ARENA program.
- If you still haven’t had enough, check out metauni AI-safety seminar reading list.

### High-level

Developmental interpretability draws on ideas and methods from a range of areas of mathematics, statistics and the sciences, but at the moment the key techniques come from Singular Learning Theory (SLT) and to a lesser extent developmental biology and statistical physics. The readings focus on SLT.

Start here:

- Towards Developmental Interpretability: introduces the idea of studying the development of neural networks over training using ideas from SLT and various parts of science. Read this to get a broad understanding of the relevance of these techniques to alignment.
- Distilling Singular Learning Theory 0-4 by Liam Carroll, introducing SLT and explaining what it says about phases and phase transitions (in the sense of the Bayesian learning process).
- Watanabe (2022): A good survey by the master himself, about the major results of SLT.
- Watanabe’s Keynote: Watanabe makes the case for AI risk from a different perspective to the one you’ll usually see.
- SLT High 3: The Learning Coefficient provides some intuitions for how to think about the learning coefficient.
- QD 1 distillation: You’re Measuring Model Complexity Wrong by Jesse Hoogland and Stan van Wingerden explains why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value.
- SLT High 1: The Logic of Phase Transitions explains how to apply the free energy formula in practice to reason about the singular learning process.
- TMS 1 distillation by Liam Carroll and Edmund Lau shows the singular learning process and learning coefficient in action, in a small toy model introduced by Anthropic.
- Open Problems in DevInterp by Daniel Murfet surveys the current state of affairs for devinterp and our broader research agenda as of November 2023.

### Low-level

For developmental interpretability and application of SLT to deep learning:

- QD 1 introduces local learning coefficient estimation using SGLD and localizing priors. This is the basis for the geometric probes we are developing and that we plan to develop.
- TMS 1 shows that we can use the learning coefficient to reason about the development of neural networks in the context of Anthropic’s Toy Models of Superposition.
- ICL 1 (upcoming) shows that the development of neural networks is organized into discrete stages that we can detect with local learning coefficient estimation and essential dynamics. It explores two settings: O(50k)-parameter transformers trained to perform linear regression and O(5m)-parameter language models trained on internet text.

### Textbooks

The textbooks in Singular Learning Theory are:

- Sumio Watanabe “Algebraic Geometry and Statistical Learning Theory” 2009 (Gray book)
- Sumio Watanabe “Mathematical Theory of Bayesian Statistics” 2018 (Green book)

### Bonus

If you’re interested in more, there are many talks on the metauni seminar page and from the two SLT conferences.

## Glossary

Some acronyms (sorry there are so many):

**QD 1**: “Quantifying degeneracy in singular models via the learning coefficient” by Edmund Lau, Daniel Murfet and Susan Wei.**RLCT**: the real log canonical threshold**(L)LC**: the (local) learning coefficient**SLT**: Singular learning theory**GPS**: Geometry of Program synthesis**TMS**: Toy Models of Superposition**TMS 1**: “Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition” by Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet.**SPL**: Statistical physics of learning**DLN**: Deep linear networks**ICL**: In-context learning**ICL 1**: “Development of in-context learning in transformers” by Jesse Hoogland, George Wang, Liam Carroll, Matthew Farrugia-Roberts, Susan Wei, and Daniel Murfet (upcoming)