# Resources

### SLT for Alignment

Start here:

- SLT for Alignment: Why does SLT matter at all for alignment? (Note: This is currently a google doc; a similar version will soon be posted to LessWrong.)
- Towards Developmental Interpretability: Why study how neural networks change over training?

### SLT

Start here: Dialogue introduction to SLT

I want to learn theoretical SLT:

- Assuming you have a background in mathematics (esp. algebraic geometry), physics, or statistical learning theory…
- Read through the
**Theoretical SLT**section.

I want to learn applied SLT:

- Assuming you have a background in ML…
- Read through the
**LLC estimation**section. - Play around with the
**demo notebooks.** - Read (at least) the distillations of the
**DevInterp papers.** - Start applying LLC estimation to your own models.

#### Theoretical SLT

Essential reading:

- Distilling Singular Learning Theory 0-4 (by Liam Carroll) introduces SLT and explains what it says about phases and phase transitions (in the sense of the Bayesian learning process).
- Singular Learning Theory: exercises (by Zach Furman). Reading is not enough. If you are serious about this, do the pen-and-paper exercises.

Advanced materials:

- Watanabe (2022): A good survey by the master himself, about the major results of SLT.
- See the publications here

#### Textbooks

The textbooks in SLT are:

**The Grey Book**

*Sumio Watanabe “Algebraic Geometry and Statistical Learning Theory” 2009*

- This is where all the details of the proofs of the main results of SLT are contained. It is a research monograph distilling the results proven over more than a decade. This is not an easy book to read.
- Chapter 1 provides a coarse treatment of the underlying proof ideas and mechanics.
- Chapter 2-5: The results of SLT depend on a lot of results from other fields of mathematics (algebraic geometry, distribution theory, manifold, empirical processes, etc). The book gives some background in each of these fields rather quickly. Scattered through these introductions is some material on how these fields relate to the core results in SLT.
- Chapter 6 contains the main proofs of SLT.
- Chapter 7 contains applications of the main results and examples of various learning phenomena in singular models.

**The Green Book**

*Sumio Watanabe “Mathematical Theory of Bayesian Statistics” 2018*

- This more recent book is much more focused on learning in singular models (esp. Bayesian learning).
- There are many exercises at the end of each chapter.
- This is also where Watanabe handles the non-realisable case. This requires the introduction of a new technical condition known as “relatively finite variance”.
- While not recapitulating the full proof given in the Grey Book, the Green Book does go through slightly different formulations of the theory and, by assuming some technical results in the Grey Book, it walks through the proofs of most results.

There is also an **exercise textbook:**

*Joe Suzuki, “WAIC and WBIC with R Stan Joe Suzuki 100 Exercises for Building Logic” 2019*

#### Applied/Experimental SLT

#### LLC Estimation

Currently, the key experimental technique in applying SLT to real-world models is local learning coefficient (LLC) estimation, introduced in Lau et al. (2023).

- Quantifying degeneracy in singular models via the learning coefficient (Lau et al. 2023) introduces the
*local*learning coefficient (LLC) along with an SGLD-based*estimator*for the LLC.- [Distillation] You’re Measuring Model Complexity Wrong by Jesse Hoogland and Stan van Wingerden explains why you should care about model complexity, why the local learning coefficient is arguably the correct measure of model complexity, and how to estimate its value.

- Estimating the local learning coefficient at scale (Furman & Lau 2024) is a follow-up to Lau et al. 2023, that tries to verify how accurately LLC estimation is in the setting of deep linear networks (DLNs).
- (Optional) SLT High 3: The Learning Coefficient provides some intuitions for how to think about the learning coefficient.

**Putting it in practice:** Once you’ve read the above materials, get some hands-on practice with the example notebooks in devinterp, starting with this introductory notebook.

#### Developmental interpretability

Developmental interpretability proposes to study changes in neural network structure over the course of training (rather than trying to interpret isolated snapshots). This draws on ideas and methods from a range of areas of mathematics, statistics, and the (biological) sciences.

At the moment, the key techniques, namely applying LLC estimation over the course of training, come from Singular Learning Theory (SLT) and to a lesser extent developmental biology and statistical physics.

The readings focus on SLT:

- [Lecture] SLT High 1: The Logic of Phase Transitions explains how to apply the free energy formula in practice to reason about the singular learning process.
- Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition (Chen et al. 2023) studies Anthropic’s Toy Model of Superposition using SLT. This 1) in a theoretically tractable but non-trivial model that knowing the leading order terms in the free energy expansion does allow us to predict phases and phase transitions in Bayesian learning. 2) demonstrating that we can use the learning coefficient to track the development of neural networks.
- [Distillation] Growth and Form in a Toy Model of Superposition (by Liam Carroll and Edmund Lau)

- The Developmental Landscape of In-Context Learning (Hoogland et al. 2024) shows that the development of neural networks is organized into discrete stages that we can detect with local learning coefficient estimation and essential dynamics
- [Distillation] Stagewise Development in Neural Networks.
- [ICML 2024 workshop version]. Start here before reading the Arxiv version.

#### Bonus

- Check out metauni, a weekly seminar that runs in Roblox and features dozens of seminars on SLT.
- Check out lectures from the two SLT conferences.

### Alignment

#### Basics

- AI Alignment Metastrategy (Kosoy 2023): provides a strong overview of the different philosophical strands of AI safety research.
- Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger et al. 2019): how dangerous behaviors could arise naturally in capable systems trained by gradient descent, introduces the idea of deceptive alignment.

#### Interpretability

- Zoom In: An Introduction to Circuits (Olah et al. 2020): makes the case for interpretability as a science.
- A Transparency and Interpretability Tech Tree (Hubinger 2022): makes the case for interpretability contributing to alignment.
- In-Context Learning and Induction Heads (Olsson et al. 2022): establishes a link between high-level changes in model behavior (in-context learning) and structural changes (induction-heads)
- Toy Models of Superposition (Elhage et al. 2022): describes the problem of “superposition” in interpretability.
- A Mathematical Framework for Transformer Circuits (Elhage et al. 2021): if you want to understand how transformers compute, you need to be fluent with how attention works.
- Formal Algorithms for Transformers (Phuong and Hutter 2022): for precise definitions of the ingredients of transformers, often difficult to extract from other literature.
- Progress measures for grokking via mechanistic interpretability (Nanda et al. 2023): one of the most in-depth examples of reverse-engineering the algorithm learned by a neural network.
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Wang et al. 2022): interpretability tools can be successfully applied to large(ish) models.

#### Bonus

- If you want to get a higher-level overview of the alignment landscape, check out Alignment 101 and Alignment 201 by Richard Ngo and BlueDot Impact.
- If you want more material on learning ML, see the ARENA program.
- If you still haven’t had enough, check out metauni AI-safety seminar reading list.