You Are What You Eat – AI Alignment Requires Understanding How Data Shapes Structure and Generalisation

Simon Pepin Lehalleur ⁼

University of Amsterdam

Jesse Hoogland ⁼

Timaeus

Matthew Farrugia-Roberts ⁼

University of Oxford

Susan Wei

Monash University

Alexander Gietelink Oldenziel

Timaeus & University College London

George Wang

Timaeus

Stan van Wingerden

Timaeus

Zach Furman

University of Melbourne

Liam Carroll

Timaeus & Gradient Institute

Daniel Murfet

University of Melbourne

February 8, 2025

Abstract

In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.

Cite as

@article{lehalleur2025you,
  title = {You Are What You Eat – AI Alignment Requires Understanding How Data Shapes Structure and Generalisation},
  author = {Simon Pepin Lehalleur and Jesse Hoogland and Matthew Farrugia-Roberts and Susan Wei and Alexander Gietelink Oldenziel and George Wang and Stan van Wingerden and Zach Furman and Liam Carroll and Daniel Murfet},
  year = {2025},
  abstract = {In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.},
  eprint = {2502.05475},
  archivePrefix = {arXiv},
  url = {https://arxiv.org/abs/2502.05475}
}

Click to copy