Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition
Abstract
We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT). We derive a closed formula for the theoretical loss and, in the case of two hidden dimensions, discover that regular k-gons are critical points. We present supporting theory indicating that the local learning coefficient (a geometric invariant) of these k-gons determines phase transitions in the Bayesian posterior as a function of training sample size. We then show empirically that the same k-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we find that the learning process in TMS, be it through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity.
Main contributions:
- SLT works. SLT predicts the empirically observed phase transitions in a toy model of superposition even at moderate dataset sizes. This is a substantial piece of evidence that SLT is relevant to “real systems.”
- Dynamical phase transitions are related to Bayesian ones. In this setting, dynamical transitions (that occur over training time) have an “antecedent” in Bayesian transitions (that occur with increasing dataset size). The two types of transitions are related as hypothesized in the devinterp agenda.
- The LLC works. The estimated LLC correctly describes phase transitions, resolving a major open question.
See the accompanying distillation.
Cite as
@article{chen2023dynamical,
title = {Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition},
author = {Zhongtian Chen and Edmund Lau and Jake Mendel and Susan Wei and Daniel Murfet},
year = {2023},
abstract = {We investigate phase transitions in a Toy Model of Superposition (TMS) using Singular Learning Theory (SLT). We derive a closed formula for the theoretical loss and, in the case of two hidden dimensions, discover that regular k-gons are critical points. We present supporting theory indicating that the local learning coefficient (a geometric invariant) of these k-gons determines phase transitions in the Bayesian posterior as a function of training sample size. We then show empirically that the same k-gon critical points also determine the behavior of SGD training. The picture that emerges adds evidence to the conjecture that the SGD learning trajectory is subject to a sequential learning mechanism. Specifically, we find that the learning process in TMS, be it through SGD or Bayesian learning, can be characterized by a journey through parameter space from regions of high loss and low complexity to regions of low loss and high complexity.},
eprint = {2310.06301},
archivePrefix = {arXiv},
url = {https://arxiv.org/abs/2310.06301}
}