DSLT 3. Neural Networks are Singular
Read on LessWrongTLDR; This is the third main post of Distilling Singular Learning Theory which is introduced in DSLT0. I explain that neural networks are singular models because of the symmetries in parameter space that produce the same function, and introduce a toy two layer ReLU neural network setup where these symmetries can be perfectly classified. I provide motivating examples of each kind of symmetry, with particular emphasis on the non-generic node-degeneracy and orientation-reversing symmetries that give rise to interesting phases to be studied in DSLT4*. *
As we discussed in DSLT2, singular models have the capacity to generalise well because the effective dimension of a singular model, as measured by the RLCT, can be less than half the dimension of parameter space. With this in mind, it should be no surprise that neural networks are indeed singular models, but up until this point we have not exactly explained what feature they possess that makes them singular. In this post, we will explain that in essence:
**Neural networks are singular because there are often ways to vary their parameters without changing the function they compute. **
In the case where the model and truth are both defined by similar neural network architectures, this fact means that the set of true parameters \(W_0\) is non-trivial (i.e. bigger than the regular case where it is a single point), and often possesses many symmetries. This directly implies that neural networks are singular models.
The primary purpose of this post is to show with examples why neural networks are singular, and classify the set of true parameters \(W_0\) in the case where the model and truth are simple two layer feedforward ReLU networks. In doing so, we will lay the groundwork for understanding the phases present in the setup so that we can then study relevant phase transitions in DSLT4. Feel free to jump ahead to the slightly more exciting DSLT4 Phase Transitions in Neural Networks and refer back to this post as needed.
Outline of Classification
To understand the different regions that minimise the free energy (and thus, as we’ll see in DSLT4, the phases), one needs to first understand the singularities in the set of optimal parameters of \(K(w)\).
In the realisable regression case with a model neural network \(f(x,w)\) and true neural network defined by \(f_0(x) = f(x,w^{(0)})\) for some \(w^{(0)} \in W\), the set of true parameters has the form 1
\[W_0 = {w \in W ,|,f(x,w) = f_0(x) } ,.\]
Thus, classifying the true parameters is a matter of establishing which parameters \(w \in W\) yield functional equivalence between the model and the truth \(f(x,w) = f_0(x)\). The property of being singular is specific to a model class \(f(x,w)\), regardless of the underlying truth. But, classifying \(W_0\) in the realisable case is a convenient way of studying what functionally equivalent symmetries exist for a particular model class.
Neural networks have been shown to satisfy a number of different symmetries of functional equivalence across a range of activation functions and architectures, which we will elaborate on throughout the post. Unsurprisingly, the nonlinearity of the activation function plays a central role in governing these symmetries. In general, then, deep neural networks are highly singular.
In this post we are going to explore a full characterisation of the symmetries of \(W_0\) when the model is a two layer feedforward ReLU neural networks with \(d\) hidden nodes, and the truth is the same architecture but with \(m\leq d\) nodes. Though you would never use such a basic model in real deep learning, the simplicity of this class of network allows us to study \(W_0\) with full precision. We will see that:
- If the model and truth have the same number of nodes,\(m=d\): There are three forms of symmetry of \(W_0\):
- Scaling symmetry of the incoming and outgoing weights to any node.
- Permutation symmetry of the hidden nodes in a layer.
- Orientation reversing symmetry of the weights, only when some subset of weights sum to zero (i.e. “annihilate” one another).
- If the model has more nodes than the truth,\(m<d\): Without loss of generality, the first \(m\) nodes of the model must have the same symmetries as in the first case. Then each excess node \(i \in {m+1,\dots, d}\) is either
- Degenerate, meaning its total weight (gradient) is 0 (thus the node is always constant).
- Or it has the same activation boundary as another already in the model such that the weights sum to the total gradient in a region 2.
In [Carroll, Chapter 4], I give rigorous proofs that in both cases, \(W_0\) is classified by these symmetries, and these symmetries alone. The purpose of this post is not to repeat these proofs, but to provide the intuition for each of these symmetries. I have included a sketch of the full proof in the appendix of this post if you are more mathematically inclined.
Two layer Feedforward ReLU Neural Networks
Literature abounds on what neural networks are, so I will merely give the definition of the class we are going to study here and some related terminology for the discussion.
Defining the Networks and Terminology
Let \(W \subseteq \mathbb{R}^{4d+1}\) be a compact parameter space. We will let \([d] = {1,\dots, d}\) denote the set of hidden nodes in the first layer of our network, and \(\langle w_i, x\rangle\) denote the standard dot product between two vectors. Also recall that
\[\mathrm{ReLU}(x) = \begin{cases} x & \text{if} ; x \geq 0 \ 0 & \text{if} ; x <0 \end{cases} ,.\]
We let \(f : \mathbb{R}^2 \times W \to \mathbb{R}^1\) denote a two layer feedforward ReLU neural network with two inputs \(x_1, x_2\) and one output \(y\), defined by a parameter \(w \in W\). The function is given by
\[f(x,w) = c + \sum_{i=1}^d q_i \mathrm{ReLU}(\langle w_i, x\rangle + b_i)\]
where for each \(i\in[d]\):
- the first layer weights are \(w_i \in \mathbb{R}^2\) and the biases are \(b_i \in \mathbb{R}\)
- the second layer weights are \(q_i \in \mathbb{R}\) and the bias is \(c \in \mathbb{R}\).
These functions are simply piecewise affine functions (i.e. piecewise hyperplanes), and as such they have (relatively) easy topology to study. Before we give an example, we will briefly mention some key terminology.
Let \(f_w(x) =f(x,w)\) be defined by a fixed \(w \in W\). We say a particular node \(i \in [d]\) is degenerate in \(f_w\) if either of the weights are zero, so \(w_i =0\) or \(q_i=0\). 3
We say a non-degenerate node \(i\) is activated in some linear domain 4 \(U \subseteq \mathbb{R}^2\) when the ReLU is non-zero for all \(x \in U\) , that is,
\[\langle w_i, x \rangle + b_i = w_{i,1}x_1 + w_{i,2}x_2 + b_i > 0 ,.\]
The activation boundary associated to node \(i\) is thus the line
\[H_i = {x \in \mathbb{R}^2 , | , \langle w_i, x \rangle + b_i = 0 },.\]
One of the key accounting tools in the symmetry classification is identifying the foldsets of \(f_w\) (in the terminology of [PL19]), which are the regions where \(f_w\) is non-differentiable in \(x\), and noticing that these equate to the union of non-degenerate activation boundaries \(H_i\). Two functionally equivalent networks must then have the same foldsets since they define the same function, allowing us to compare the lines defined by \(H_i\).
Example - Feedforward ReLU Neural Networks are Piecewise Hyperplanes
Example 3.1: Consider the following two layer feedforward ReLU neural network:
\[\begin{aligned} f_w(x) &= \mathrm{ReLU}(x_1-1) + \mathrm{ReLU}(x_2-1) \ &\quad + \mathrm{ReLU}(-x_1-1) + \mathrm{ReLU}(-x_2-1) ,. \end{aligned}\]
defined by biases \(b_i=-1\) and \(c=0\), second layer weights \(q_i=1\), and first layer weights
\[w_1 = \begin{pmatrix} 1 \0 \end{pmatrix}, \quad w_2 = \begin{pmatrix} 0\1 \end{pmatrix}, \quad w_3 = \begin{pmatrix} -1\0 \end{pmatrix}, \quad w_4 = \begin{pmatrix} 0-1 \end{pmatrix} ,.\]
Its graphical structure and activation boundaries in the \((x_1, x_2)\) plane can be seen below:
The data of \(f_w(x)\) above.
Conceptually, it’s helpful to notice that when anchored on its corresponding activation boundary, each weight vector \(w_i\) “points” into its region of activation.
The Symmetries of Two Layer Feedforward ReLU Neural Networks
In this section I am going to provide some motivating examples of each kind of symmetry exhibited in two layer feedforward ReLU neural networks. To prove that this is the full set of symmetries in generality requires a bit more work, which we relegate to the appendix.
Scaling Inner and Outer Weights of a Node
The scaling symmetry of ReLU networks offers us our first window into why these models are singular. The key property is to notice that for any \(\alpha > 0\), the ReLU satisfies a scale invariance 5
\[\frac{1}{\alpha}\mathrm{ReLU}(\alpha x) = \mathrm{ReLU}(x) ,.\]
Say we had the simplest model possible with just one node:
\[f(x,w) = q_1 \mathrm{ReLU}(\langle w_1, x\rangle + b_1) + c,.\]
Then we could define an alternative parameter \(w’\) with
\[q_1’ = \frac{q_1}{\alpha}, \quad w_1’ = \alpha w_1, \quad b_1’ = \alpha b_1, \quad c’=c ,,\]
which gives functional equivalence because,
\[\begin{aligned} f(x,w’) &= q_1’ \mathrm{ReLU}(\langle w_1’, x\rangle + b_1’) + c’ \ &= \frac{q_1}{\alpha} \mathrm{ReLU}(\langle \alpha w_1, x \rangle + \alpha b_1 ) + c\ &= \frac{q_1}{\alpha} \mathrm{ReLU}(\alpha(\langle w_1, x\rangle + b_1))+ c \ &= q_1 \mathrm{ReLU}(\langle w_1, x\rangle + b_1)+ c \ &= f(x,w) ,. \end{aligned}\]
For a model with \(d\) hidden nodes, the same scaling symmetry applies to each individual node \(i\in[d]\) with a set of scaling factors \(\alpha_i>0\).
The fact that we can define such a \(w’\) for any set of positive scalars means that the Fisher information matrix of these models is degenerate at all points \(w \in W\). We prove this in generality in Appendix 1, but I’ll spell it out explicitly for a simple example here.
Example - Scaling Symmetry Induces a Degenerate Fisher Information Matrix
Example 3.2: It is worth taking a moment to recognise how this scaling symmetry affects the geometry of the loss landscape \(K(w)\). The mental model to have here is that it results in valleys in \(K(w)\), where the set of true parameters \(W_0\) is like a river on the valley floor. To see this, say we defined a model with parameter \(w=(w,q)\) and truth as:
\[f(x,w) = q\mathrm{ReLU}(wx),, \quad f_0(x) = \theta_0 \mathrm{ReLU}(x) ,,\]
where \(\theta_0>0\) is some fixed constant. If \(q(x)\) is uniform on \([-\sqrt{3}, \sqrt{3}]\) then it is easy to calculate that when \(w,q\geq0\) we have
\[K(w)=\left( wq - \theta_0\right)^2,, \quad \text{so}\quad W_0 = \left{(w,q) ,|, wq=\theta_0 \right} ,.\]
We can depict this valley and its effect on the posterior for \(\theta_0=\frac{1}{5}\):
Setting \(\theta_0=\frac{1}{5}\), we see that \(K(w)\) is a valley due to the scaling symmetry (left), thus there is no unique maximum a posterior (right). Remember that, up to a scaling factor, \(e^{-nK_n(w)}\) is the posterior when the prior \(\varphi(w)\) is uniform, and \(e^{-n K_n(w)} \approx e^{-n K(w)}\) for large \(n\) since \(\mathbb{E}[K_n(w)]=K(w)\).
Looking at this \(K(w)\), it’s easy to intuit that the Fisher information matrix \(I(w)\) is degenerate for all \(w\). But, for clarity, let me spell this out for the true parameters in the case where \(\theta_0=1\), so \(K(w)=(wq-1)^2\).
Remember that at true parameters the Fisher information matrix is just the Hessian, which in this case has the form
\[J(w) = \begin{pmatrix} 2q^2 & 4wq-2 \ 4wq-2 & 2w^2 \end{pmatrix} ,.\]
In particular, let \(w^{(0)} \in W_0\) be a fixed true parameter parameterised by a fixed \(\alpha>0\), so \(w^{(0)}=(\alpha, \frac{1}{\alpha})\). Then the Fisher information matrix has the form
\[I(w^{(0)}) = \begin{pmatrix} \frac{2}{\alpha^2} & 2 \ 2 & 2\alpha^2\end{pmatrix} ,.\]
\(\)Setting \(I_1(w^{(0)})\) and \(I_2(w^{(0)})\) to be the rows of the matrix, there is clearly a linear dependence relation
\[-\alpha^2I_1(w^{(0)}) + I_2(w^{(0)}) = 0\]
and since \(\alpha\) is arbitrary, this shows that all true parameters have degenerate Fisher information matrices and are thus singular.
Permutation of Nodes
This one is easy to see. If we have a model with \(d=2\) nodes,
\[f(x,w) = q_1 \mathrm{ReLU}(\langle w_1, x\rangle + b_1) + q_2 \mathrm{ReLU}(\langle w_2, x\rangle + b_2) + c,,\]
and we define a new model \(f(x,w’)\) where \(w’\) is a permutation of the nodes in \(f(x,w)\),
\[(w_1’, b_1’, q_1’) = (w_2, b_2, q_2),, \quad (w_2’, b_2’, q_2’) = (w_1, b_1, q_1), \quad \text{and} ,, c’ =c ,,\]
then
\[\begin{aligned} f(x,w’) &= q_1’ \mathrm{ReLU}(\langle w_1’, x\rangle + b_1’) + q_2’ \mathrm{ReLU}(\langle w_2’, x\rangle + b_2’) + c’ \ &= q_2 \mathrm{ReLU}(\langle w_2, x\rangle + b_2) + q_1 \mathrm{ReLU}(\langle w_1, x\rangle + b_1) + c \ &= f(x,w) ,. \end{aligned}\]
This easily generalises to \(d\) hidden nodes by taking any permutation \(\sigma \in S_d\) in the permutation group \(S_d\) and letting each node \(i’\) of \(f(x,w’)\) satisfy \(i’=\sigma(i)\), so
\[\begin{aligned} f(x,w) &= c + \sum_{i=1}^d q_i \mathrm{ReLU}(\langle w_i, x\rangle + b_i) \ &= c+ \sum_{i=1}^d q_{\sigma(i)} \mathrm{ReLU}(\langle w_{\sigma(i)}, x\rangle + b_{\sigma(i)}) = f(x,w’) ,. \end{aligned}\]
Permuting nodes induces functional equivalence, here depicted for \(\sigma=(1,3)(2,4)\).
Orientation Reversal
This one is a bit trickier to observe as the symmetry depends on a very specific condition of weight annihilation. Let’s look at a simple example first.
Motivating Example
Example 3.3: Consider a true distribution defined by a (one-input) feedforward ReLU given by
\[\begin{aligned} f_0(x) &= \mathrm{ReLU}(x-1) + \mathrm{ReLU}(-x-1) + 2 \ &= \begin{cases} -x+1 & x\leq -1\ 2 & -1\leq x \leq 1\ x+1 & x\geq 1 \end{cases} \end{aligned}\]
where \(w^{(0)}_1=1\), \(w^{(0)}_2=-1\), and the activation boundaries are \(H^{(0)}_1 = {x=1}\) and \(H^{(0)}_2 = {x=-1}\).
Surprisingly, though it may appear our linear regions and activation boundaries must uniquely define the function (up to the scaling and permutation symmetries), there is a particular symmetry that arises by reversing the orientation of the weights and first layer biases, and adjusting the total bias accordingly. When we say reverse the orientation, we mean negating their direction,
\[w_1 = - w_1^{(0)}=-1 \quad \text{and} \quad w_2 = -w_2^{(0)}=1 ,,\]
and ditto for the biases. If we adjust the total bias \(c\) accordingly, then following function
\[f(x,w) = \mathrm{ReLU}(-x+1) + \mathrm{ReLU}(x+1)\]
gives the same functional output!
Reversing the orientation of the true weights preserves this function because the true weights annihilate one another.
There is a very specific reason we can do this: in the middle region \(-1\leq x\leq 1\), both nodes are active and cancel out to give a constant function,
\[f(x,w) = (-x+1) + (x+1) = 2 ,,\]
because the total gradients of the underlying truth sum to zero, \(w_1^{(0)} + w_2^{(0)} =0\).
General case
Suppose the true network \(f_0(x)\) is defined by a fixed \(w^{(0)}=(w_{1}^{(0)}, \dots, b_1^{(0)},\dots q_1^{(0)}, \dots, c)\) for \(m\) nodes. If there is a set \(F \subseteq [m]\) of total gradients that sum to 0,
\[\sum_{i \in F} q_i^{(0)} w_i^{(0)} = 0\]
then the model can produce functional equivalence by reversing the orientation of those particular weights (associated to those activation boundaries), biases, and adjusting the total bias. In other words, modulo permutation and scaling symmetry, there is a functionally equivalent network to \(f_0(x)\) where the weights of every \(i \in F\) satisfy
\[w_i = -w_i^{(0)} ,.\]
We call the condition \(\sum_{i \in F} q_i^{(0)} w_i^{(0)} = 0\) weight annihilation.
In [Carroll21, \(\S\)4.5] we define \(m\)-symmetric networks where the weights are progressive rotations by the angle \(\frac{2\pi}{m}\), thus their total sum is zero. In DSLT4, we will study whether the posterior prefers configurations of weight-annihilation or not. (The answer is: not). 6
An \(m\)-symmetric network for \(m=3\) with \(q_i^{(0)}=1\) and \(c^{(0)}=0\). Both configurations, non-weight-cancellation (left) and weight-cancellation (right), are functionally equivalent since \(\sum_{i=1}^3 w_i^{(0)} = 0\). Here, weight cancellation refers to the configuration where all three nodes are active in the central linear domain, but cancel to give an effective gradient of zero there.
Node Degeneracy
This is possibly the most important symmetry of all: neural network models can have more nodes than they need to represent a particular function. In essence, this degeneracy is the reason that different regions of the loss-landscape \(K(w)\) of neural networks have fundamentally different accuracy-complexity tradeoffs. In other words, if the model has \(d\) nodes in the hidden layer available to it, then all possible subnetwork configurations with less than \(d\) nodes are also contained within the loss landscape. Thus, increasing the width of the network can only serve to increase the accuracy of these models, without sacrificing its ability to generalise, since the posterior will just prefer that number of hidden nodes with the best accuracy-complexity tradeoff.
Motivating Example
Example 3.4: Suppose we had a (one-input) true network given by
\[f_0(x) = \mathrm{ReLU}(x)\]
and our model had \(d=2\) nodes (with fixed biases \(b_1=b_2=c=0\) and outgoing weights \(q_1=q_2=1\)),
\[f(x,w) = \mathrm{ReLU}(w_1x) + \mathrm{ReLU}(w_2x) ,.\]
Since \(f_0(x)=0\) for \(x\leq 0\), both weights must be positive, \(w_1, w_2 \geq0\), to have any hope of being functionally equivalent. If \(f(x,w)=f_0(x)\), we are in one of two configurations:
One node is degenerate: Either \((w_1, w_2) = (1,0)\) or \((w_1, w_2)=(0,1)\), meaning
\[f(x,w) = \mathrm{ReLU}(1x) + \mathrm{ReLU}(0x) = \mathrm{ReLU}(x) = f_0(x),.\]
Both nodes are non-degenerate, but the total gradient is the same as the truth: So long as the weights satisfy
\[w_1 + w_2 = 1,,\]
for \(w_1, w_2 >0\), we will have functional equivalence since, setting \(w_2=1-w_1\),
\[\begin{aligned} f(x,w) &= \mathrm{ReLU}(w_1x) + \mathrm{ReLU}((1-w_1)x) \ &= \begin{cases} w_1x+ (1-w_1)x & x \geq 0 \ 0 & x \leq 0 \end{cases} \ &= \mathrm{ReLU}(x) \ &= f_0(x) ,.\end{aligned}\]
Node-degeneracies Correspond to Different Phases
We could of course encapsulate both of these configurations into the one statement that \(w_1+w_2=1\) for \(w_1, w_2 \geq 0\), but there is a key reason we have delineated them: they represent two different phases and have different geometry on \(K(w)\). Intuitively, the degenerate phase is a simpler model with less complexity, thus we expect it has a lower RLCT 7, and for the posterior to prefer it. In DSLT4 we will discuss phases in statistical learning more broadly, and display experimental evidence for this latter claim.
To foreshadow this, we can actually calculate \(K(w)\) for Example 3.4. Setting the prior \(q(x)\) to be uniform on \([-\sqrt{6}, \sqrt{6}]\) we find
\[K(w_1, w_2) = \begin{cases} (w_1+w_2-1)^2 & w_1, , w_2 \geq 0 \ w_1^2 + (w_2-1)^2 & w_1 \leq 0,, , w_2 \geq 0 \ (w_1 - 1)^2 + w_2^2 & w_1 \geq 0,, , w_2 \leq 0 \ (w_1+w_2)^2 + 1 & w_1, , w_2 \leq 0 \end{cases} ,.\]
\(K(w)\) for the above example with slightly wider bowls at the degenerate-node phases.
Notice how there are ever so slightly wider bowls at either end of the line \(w_1+w_2=1\), thus suggesting the posterior has more density at the degenerate phase \((w_1,w_2)=(0,1)\)(or vice versa). Intuitively, imagine a tiny ball being with random kinetic energy rolling around the bottom of the surface - it will spend more time in the ends since there is more catchment area. (Don’t take the physics analogy too seriously, though).
We see once again that the singularity structure of the minima has a big impact on the geometry of \(K(w)\), and therefore the posterior.
General Case
Suppose we have a truth \(f_0(x)\) and model \(f(x,w)\) that are both defined by two layer feedforward ReLU neural networks, where the model has \(d\) nodes and the truth has \(m\) nodes (assumed to all be non-degenerate and with distinct activation boundaries) such that \(m<d\). Then the model is overparameterised compared to the truth it is trying to model.
Performing the appropriate analysis (which we do in Appendix 2), one finds that:
- Without loss of generality (i.e. up to permutation symmetry), the first \(m\) nodes of \([d]\) must have the same activation boundaries as those in \(f_0(x)\), and satisfy the same scaling, permutation and orientation reversing symmetries as discussed above.
- For the remaining nodes \(i \in {m+1,\dots,d}\) of the model, each one either:
- Is degenerate, so \(w_i=0\) or \(q_i=0\), or;
- Shares the same activation boundary as one already in \([m]\) such that the total gradients sum to the correct gradient in each region. (In our above example, this is saying that necessarily \(w_1+w_2=1\) since these nodes share the same activation boundary).
The function \(f_0(x) = 2\mathrm{ReLU}(x_2-\frac{1}{3})\) can also be represented by a two node model where both nodes share the same boundary, \(f(x,w) = \mathrm{ReLU}(x-\frac{1}{3}) + \mathrm{ReLU}(x-\frac{1}{3})\).
In DSLT4 we will test which of the phases in the above figure is preferred by the posterior for this simple two layer feedforward ReLU neural networks setup.
Node degeneracy is the same as lossless network compressibility
The fact that neural networks can contain these node-degeneracies is well known and often goes under the guise of lossless network compressibility. There are many notions of compressibility, but the one that makes the most sense in our setup is to say that if the model has \(d>m\) hidden nodes compared to the truth, then it can be compressed to a network with only \(m\) hidden nodes and still produce the same input-output map.
For an excellent introduction to lossless network compressibility, see Farrugia-Roberts’ recent paper Computational Complexity of Detecting Proximity to Losslessly Compressible Neural Network Parameters, where he studies the problem for \(\tanh\)networks.
There are More True Parameters if the Input Domain is Bounded
Let me make an important remark here. In both of the above cases, we have considered the symmetries of \(W_0\) when the input domain of the model and the truth is all of \(\mathbb{R}^2\). As we explain in Appendix 2, this allows us to compare the gradients and biases of hyperplanes, similar to comparing polynomial coefficients, to make our conclusions. However, if the domain of the input prior \(q(x)\) is restricted to some open bounded domain \(\mathcal{Z} \subseteq \mathbb{R}^2\), there could in principle be more degeneracies and symmetries of \(W_0\), since the functional equivalence only needs to be on \(\mathcal{Z}\).
For example, consider a true network defined by \(f_0(x)=0\) and a single-node single-input model \(f(x,w) = q\mathrm{ReLU}(\langle w, x \rangle + b)\) defined on \(\mathcal{Z} = (-a,a)\), so \(q(x) = \frac{1}{2a}\mathbf{1}(-a< x < a)\). If the activation boundary falls outside of \(\mathcal{Z}\) and the vector \(w\) points away from \(\mathcal{Z}\), then any value of \(q, w,b\) satisfying these constraints would give \(f(x,w)=0\), thus there is an entirely new class of symmetry in \(W_0\).
Whilst important to keep this mind, we won’t discuss this any further as it opens up an entirely different can of worms.
Even if Singularities Occur with Probability Zero, they Affect Global Behaviour in Learning
I want to make a quick comment on the work of Phuong and Lampert in [PL19]. In this paper, they prove equivalent results to these for arbitrary depth feedforward ReLU neural networks (with non-increasing widths), but with a key distinction: they consider general models. In their words,
A sufficient condition for a network to be general with probability one is that the weights are sampled from a distribution with a density.
They then show that almost all feedforward ReLU networks with this architecture are general, and then show that general networks only satisfy scaling and permutation symmetries, thus excluding our orientation-reversing and degenerate node singularities since they occur on a set of measure zero. Importantly, this implies that almost all parameters \(w \in W\) have no degenerate nodes, or equivalently, no opportunity for lossless compression.
However, even though scaling and permutation symmetries may be the only generic symmetries (in the sense of measure theory) that occur with non-zero probability, SLT tells us that the singularities of \(K(w)\) have global effects on the loss landscape, as we discussed at length in DSLT2. If a parameter is near a non-generic singularity (i.e. one that occurs with probability zero), it computes a function that is almost identical to the one computed by that of a non-generic singularity. If we shift our language to that of compressibility of a network, SLT tells us that:
Just because a particular point \(w \in W\) sampled from a posterior (or, notionally, obtained via running SGD) is not directly compressible itself, that doesn’t mean that it isn’t extremely close to one that is.
In this sense, SLT tells us that to understand the geometry of the loss landscape, we need to consider singularities even though they are not generic points. As Watanabe says, *singularities contain knowledge. *
Appendix 1 - Formal Proof that Neural Networks are Singular
If, like me, you are mathematically inclined, you probably want to see a proof that these neural networks are, indeed, singular models, to tie together the various concepts and intuitions that we have built in this sequence so far. So let’s turn into math mode briefly.
Recall that the Fisher information matrix \(I(w)\) is degenerate if and only if the set
\[\left{ \frac{\partial}{\partial w_j} f(x,w) \right}_{j=1}^D\]
is linearly dependent. Here, \(\frac{\partial}{\partial w_j}\) refers to the partial derivative with respect to the \(j\)th component of the total parameter \(w \in W\), not to be confused with the specific weight vector \(w_j\) in the neural network definition. Thus, to prove that feedforward ReLU networks are singular, our task is to find this linear dependence relation. The scaling symmetry alone is enough for this.
Theorem: Given a two layer feedforward neural network \(f: \mathbb{R}^2 \times W \to \mathbb{R}\) with \(d\) hidden nodes, for any domain on which \(f\) is differentiable, \(f\) satisfies the differential equation for a fixed node \(i\in [d]\):
\[\left{ w_{i,1} \frac{\partial }{\partial w_{i,1}} + w_{i,2} \frac{\partial }{\partial w_{i,2}} + b_i \frac{\partial }{\partial b_i} - q_i \frac{\partial}{\partial q_i} \right}f = 0 ,.\]
Proof: Since \(\frac{d}{dx} \mathrm{ReLU}(x) = \mathbf{1}{x>0}\), and letting \(a_i = \langle w_i, x\rangle + b_i\), the set of derivatives with respect to our parameters are
\[\frac{\partial f}{\partial w_{1,k}} = q_1 x_k \mathbf{1}(a_i >0) ,, \quad \frac{\partial f}{\partial b_i} = q_1 \mathbf{1}(a_i >0),, \quad \frac{\partial f}{\partial q_i} = \mathrm{ReLU}(a_i) ,,\]
and so since we can write \(\mathrm{ReLU}(a_i) = a_i \mathbf{1}(a_i >0)\) we have
\[\begin{aligned} &\left{ w_{i,1} \frac{\partial }{\partial w_{i,1}} + w_{i,2} \frac{\partial }{\partial w_{i,2}} + b_i \frac{\partial }{\partial b_i} - q_i \frac{\partial}{\partial q_i} \right}f \ &= q_i w_{i,1} x_1 \mathbf{1}(a_i>0) + q_i w_{i,2} x_2 \mathbf{1}(a_i>0) + q_i b_i \mathbf{1}(a_i>0) - q_i\mathrm{ReLU}(a_i) \ &= q_i \mathrm{ReLU}(\langle w_i, x \rangle + b_i) - q_i\mathrm{ReLU}(\langle w_i, x \rangle + b_i) = 0,. \quad \quad \quad \quad \quad \square \end{aligned}\]
Corollary: Feedforward ReLU neural networks are singular models.
Proof: For the two layer case, for any fixed \(w^0 \in W\), there is a linear dependence relation given by the above differential equation evaluated at \(w^0\), thus the Fisher information is degenerate at \(w^0\), so the model is singular.
The equivalent proof for arbitrary depths and widths is given in Lemma A.1 of [Wei22], following from other work on functional equivalence in [PL19]. \(\square\)
The degenerate node symmetries also give rise to a degenerate Fisher information matrix, though I haven’t formally written out this alternate proof yet. If you are interested, do it as an exercise and leave it as a comment!
Appendix 2 - Proof Sketch for Fully Classifying \(W_0\) for Two Layer Feedforward ReLU Networks
This section is going to be slightly more technical, and in the grand scheme of the SLT story I am telling in this sequence, this may be seen as an unnecessary side-plot. But, other readers, particularly those with a pure mathematical bent, may find it interesting to consider the process of fully classifying \(W_0\) and how one might understand all phases present, so I am providing a sketch of these proofs for completeness. Understanding the full form of \(W_0\) was a vital part of performing the phase transition experiments that we will see in DSLT4. These models are simple enough that we can perfectly classify all true parameters in \(W_0\). Thus, we can precisely understand all of its phases.
We are going to classify the symmetries of \(W_0\) when both the model \(f(x,w)\) and truth \(f_0(x)\) are two-layer feedforward ReLU neural networks, with \(d\) and \(m\) hidden nodes respectively, giving
\[W_0 = { w \in W ,|, f(x,w) = f_0(x)},,\]
meaning the task is to classify functional equivalence of the two networks. To avoid some annoying fringe cases, we assume that the true network is minimal, which means there is no network with fewer nodes that could also represent it (which also means every node is non-degenerate), and activation-distinguished, meaning every node of the truth corresponds to a unique activation boundary.
We will see that the set of symmetries explained above comprise all of the symmetries in \(W_0\) - there can be no more 8. This result rests mainly on the fact that the activation boundaries are the core piece of data that defines a neural network. The rest is then just performing accounting of the gradients and biases in each region.
This is a sketch of the proofs in Chapter 4 of my thesis, and all lemmas and theorems that are referenced in the following section come from here.
Case 1: The model has the same number of nodes as the truth, \(m=d\)
Let \(f(x,w)\) be a two layer feedforward ReLU neural network model with \(d\) hidden nodes, and let \(f_0(x)=f(x,w^{(0)})\) be the realisable true network with \(m\) hidden nodes defined by a fixed parameter \(w^{(0)}\), denoted by
\[f_0(x) = c^{(0)} + \sum_{j=1}^m q_i^{(0)}\mathrm{ReLU}\left(\langle w_j^{(0)}, x \rangle + b_j^{(0)}\right),,\]
which we assume is minimal and activation-distinguished as explained above.
We start by comparing the foldsets, which are the activation boundaries [Lemma 4.1], between the truth and the model. Let \(H_i\) be the activation boundary of the node \(i \in [d]\) in the model, and \(H_j^{(0)}\) be the activation boundary of the node \(j \in [m]\) in the truth. Then by comparing the sets of linear lines in [Lemma 4.2], we can show that for every node of the model \(i \in [d]\) there exists a permutation \(\sigma \in S_m\) such that
\[H_i = H_{\sigma(i)}^{(0)} ,.\]
By [Lemma 4.3], two activation boundaries \(H, H’\) are equal if and only if there is some non-zero scalar \(\alpha \in \mathbb{R}\backslash {0}\) such that \(w=\alpha w’\) and \(b = \alpha b’\).
Using our relation \(H_i = H_{\sigma(i)}^{(0)}\), in [Lemma 4.4] we analyse how the gradients and biases change across each activation boundary, and what this means for the relation between weights and biases in the model versus the truth. We show that there exists a unique \(\sigma \in S_m\), and for each \(i \in [d]\) an \(\epsilon_i \in \mathbb{Z}2\) and \(\alpha_i \in \mathbb{R}{>0}\) such that
\[w_i = (-1)^{\epsilon_i} \alpha_i w_{\sigma(i)}^{(0)},, \quad \text{and} \quad b_i = (-1)^{\epsilon_i} \alpha_i b_{\sigma(i)}^{(0)} ,, \quad \text{where} \quad \alpha_i = \frac{q_{\sigma(i)}^{(0)}}{q_i},,\]
meaning \(q_i\) and \(q_{\sigma(i)}^{(0)}\) necessarily have the same sign.
However, there is a restriction on which weights can have reversed orientation, \(\epsilon_i = 1\) (thus \(w_i = - \alpha_i w_{\sigma(i)}^{(0)}\)). Letting \(E ={i \in [d] | \epsilon_i=1}\), we show in [Lemma 4.5] that the weights and biases of the true network must satisfy 9
\[\sum_{i \in E} q_{\sigma(i)}^{(0)} w_{\sigma(i)}^{(0)} =0 \quad \text{and} \quad c^{(0)} + \sum_{i \in E} q_{\sigma(i)}^{(0)} b_{\sigma(i)}^{(0)} =c ,.\]
The crux of this proof rests in comparing the gradients in regions either side of the activation boundary \(H_i\).
In [Theorem 4.7] we show that these scaling, permutation and orientation reversing symmetries are the only such symmetries by piecing together all of these aforementioned Lemmas, with emphasis on the importance of the activation boundaries in defining the topology of \(f_0(x)\). 10
Case 2: The model has more nodes than the truth, \(m<d\)
We now suppose that the model is over-parameterised compared to the true network, so \(m<d\).
The key piece of data is once again the foldsets defining the model and the truth. Since they must be equal, the model can only have \(m\) unique foldsets, and thus activation boundaries. Without loss of generality (i.e. up to permutation symmetry), the first \([m] \subset [d]\) nodes in the model have the same activation boundaries as the truth, \(\bigcup_{i=1}^m H_i = \bigcup_{j=1}^m H_j^{(0)}\). Thus, these \([m]\) nodes in the model must satisfy the same symmetries as in the \(m=d\) case.
By comparing the fold sets on each excess node in \({m+1,\dots,d}\), we must have
\[\bigcup_{i=m+1}^d { H_i ,| , i , \text{is non-degenerate} } \subseteq \bigcup_{j=1}^m H_j^{(0)} ,.\]
In comparing linear lines again, this means there are two possible situations:
- \({ H_i ,| , i , \text{is non-degenerate} }\) is empty, so node \(i\) is degenerate, meaning \(q_i = 0\) or \(w_i = 0\), or;
- \(H_i = H_j^{(0)}\) for some \(j \in [m]\), so node \(i\) shares an activation boundary already in the first \([m]\) nodes of the model.
Let \(d’\geq m\) the number of non-degenerate nodes of the model. We can thus define a surjective finite set map
\[\pi : {1,\dots, m, m+1, \dots, d’} \to {1,\dots, m}\]
relating the non-degenerate nodes in the model to those in the truth, which is a bijection (i.e. a permutation \(\sigma \in S_m\)) on the first \([m] \subset [d’]\).
We can then compare the gradients and biases in each region to show that the total gradients calculated by each non-degenerate node at each unique activation boundary must sum to the gradient in the truth. Precisely, for each node \(j \in [m]\) of the truth, let \(M_j = {i \in [d’] | \pi(i) =j}\) be the set of nodes in the model that share the same activation boundary. Then for each \(i \in [d’]\) there exists an \(\epsilon_i \in \mathbb{Z}2\) and \(\alpha_i \in \mathbb{R}{>0}\) such that
\[w_i = (-1)^{\epsilon_i} \alpha_i w_{\pi(i)}^{(0)},, \quad b_i = (-1)^{\epsilon_i} \alpha_i b_{\pi(i)}^{(0)}\]
with the constraint that
\[\sum_{i \in M_j} q_i \alpha_i = q_j^{(0)} ,.\]
A similar orientation reversing symmetry also applies as in case 1, just by accounting for the nodes that share the same activation boundaries.
Resources
[Carroll21] - L. Carroll, Phase Transitions in Neural Networks, 2021
[Wei22] - S. Wei, D. Murfet, et al., Deep Learning is Singular, and That’s Good, 2022
[PL19] - M. Phuong, C. Lampert, Functional vs Parametric Equivalence of ReLU networks, 2019
Footnotes
-
Since (K(w)=0) if and only if (q(y|x) = p(y|x,w)) for some (w \in W). ↩
-
e.g. (\mathrm{ReLU}(x) + \mathrm{ReLU}(x) = \mathrm{ReLU}(2x)). ↩︎ ↩
-
For ease of classification, we exclude the case where (q_i \neq 0) and (b_i \neq 0) since we can just absorb the total bias contribution into (c). ↩︎ ↩
-
A linear domain (U \subseteq \mathbb{R}^2) is just a connected open set where (f_w) is a plane with constant gradient and bias when restricted to (U), and (U) is the maximal such set for which that plane is defined. In other words, the set of linear domains are the set of different regions the piecewise affine function are carved up into. ↩︎ ↩
-
But don’t forget, (\mathrm{ReLU}(-x) \neq -\mathrm{ReLU}(x)) as the domain of activation is completely different. ↩︎ ↩
-
Though I have not been able to formally prove it, I believe that this symmetry on its own (i.e. modulo scaling symmetry) does not result in a degeneracy of the Fisher information matrix, at least in our simple case. This, I think, is because the weights must cancel out in the region where both nodes are active, and the gradients in the other regions must be retained. Feel free to prove me wrong, though! ↩︎ ↩
-
This statement is a bit disingenuous. Watanabe’s free energy formula only applies to the case where (K(w)) is analytic, but ReLU neural networks are certainly not analytic, as we can see in the below example. With that said, Watanabe has recently proved a bound on the free energy for ReLU neural networks, showing that the complexity term is essentially related to the number of non-degenerate nodes in the truth, even if it isn’t a true RLCT. We will look at this in more depth in DSLT4. ↩︎ ↩
-
Aside from the technical caveat discussed about the restricted input prior (q(x)) above. ↩
-
Our convention is to take the empty sum to be 0, so all weight orientations being preserved, (E=\varnothing), is perfectly fine. ↩︎ ↩
-
The activation distinguished condition on the truth allows us to uniquely identify the permutation (\sigma \in S_m) relating activation boundaries, and ensures only one node changes across each boundary. ↩︎ ↩