Singular learning theory: exercises
Read on LessWrong*Thanks to Jesse Hoogland and George Wang for feedback on these exercises. *
In learning singular learning theory (SLT), I found it was often much easier to understand by working through examples, rather than try to work through the (fairly technical) theorems in their full generality. These exercises are an attempt to collect the sorts of examples that I worked through to understand SLT.
Before doing these exercises, you should have read the Distilling Singular Learning Theory (DSLT) sequence, watched the SLT summit YouTube videos, or studied something equivalent. DSLT is a good reference to keep open while solving these problems, perhaps alongside Watanabe’s textbook, the Gray Book. Note that some of these exercises cover the basics, which are well-covered in the above distillations, but some deliver material which will likely be new to you (because it’s buried deep in a textbook, because it’s only found in adjacent literature, etc).
Exercises are presented mostly in conceptual order: later exercises freely use concepts developed in earlier exercises. Starred (*) exercises are what I consider the most essential exercises, and the ones I recommend you complete first.
- *The normal distribution, like most classical statistical models, is a regular (i.e. non-singular1) statistical model. A univariate normal model with unit variance and mean \(\mu \in \mathbb{R}\) is given by the probability density \(p(x|\mu)=\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(x-\mu)^2)\). Assume the true distribution \(q(x)\) of the data is realizable by the model: that is, \(q(x)=p(x|\mu_0)\) for some true parameter \(\mu_0\).
- Calculate the Fisher information matrix of this model (note that since we have only a single parameter, the FIM will be a 1x1 matrix). Use this to show the model is regular.
- Write an explicit expression for the KL divergence \(K(\mu)\) between \(q(x)\) and \(p(x|\mu)\), as a function of the parameter \(\mu\). This quantity is sometimes also called the population loss. [See Example 1.1, Gray Book, for the case of a 2D normal distribution]
- Using \(K(\mu)\) from b), give an explicit formula for the volume of “almost optimal” parameters, \(V(\epsilon)=\int_{{\mu \mid K(\mu) < \epsilon}}\varphi(\mu) ; d\mu\), where \(\varphi(\mu)\) is the prior distribution. For convenience, let \(\varphi(\mu)\) be the improper prior \(\varphi(\mu)=1\).
- The volume scaling formula for the learning coefficient \(\lambda\) (also known as RLCT2) is \(\lambda = \lim_{\epsilon \rightarrow 0} \frac{\log(V(a\epsilon)/V(\epsilon))}{\log(a)}\) for any \(a \not=1\) [Theorem 7.1, Gray Book]. Using this formula, combined with the expression for \(V(\epsilon)\) derived in b), calculate the learning coefficient3. Given that the model is regular, we expect the learning coefficient to be \(\frac{d}{2}=\frac{1}{2}\); compare your answer.
- *We can make the normal distribution a singular model by changing the parameterization. Let a cubicly-parameterized normal model be the model \(p(x|\mu)=\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(x-\mu^3)^2)\). Assume the true parameter is \(\mu_0\).
- Show that the cubicly-parameterized normal model is just as expressive as an ordinary normal model: that is, they both can express all univariate normal distributions.
- Repeat 1a) with this model; calculate the Fisher information matrix to demonstrate that the model is singular, and find which parameters \(\mu\) are singular.
- Repeat 1b) - 1d) to calculate the learning coefficient this model, for \(\mu_0=0\), and for \(\mu_0 \not=0\).
Recall that the learning coefficient is a volume scaling exponent, such that \(V(\epsilon) \propto \epsilon^\lambda\) 4 as \(\epsilon \rightarrow 0\). Based on this, interpret your results. How does this make the cubicly-parameterized normal model different from the ordinary normal model? - Instead of taking \(\epsilon \rightarrow 0\) to get the learning coefficient, fix a small but nonzero value for \(\epsilon\), such as \(\epsilon=0.01\). Plot \(V(\epsilon)\) as a function of \(\mu_0\). As we saw from c), the learning coefficient changes discontinuously when \(\mu_0=0\) - what happens with \(V(\epsilon)\) as \(\mu_0\) gets close to zero? What changes if you make \(\epsilon\) smaller or larger?
Even though the asymptotic learning coefficient (\(\epsilon \rightarrow 0\)) only changes when \(\mu_0=0\) exactly, note how the non-asymptotic volume (\(\epsilon\) finite) is affected in a larger neighborhood.
- *For each of the two different \(K(\mu)\) formulas obtained in 1b) and 2c), create plots of \(K(\mu)\) and \(e^{-nK(\mu)}\), with \(\mu_0\) and \(n\) as variable sliders (e.g. using Desmos). Note how these functions change as \(\mu_0\) and \(n\) change. Compare results between the \(K(\mu)\) obtained in 1b) and the \(K(\mu)\) obtained in 2c).
- Consider yet another alternative parameterization of the normal distribution, this time with two parameters: \(p(x|a,b)=\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(x-a-b)^2)\). Assume the true distribution is realizable with true parameters \(a_0\) and \(b_0\).
This type of model is called minimally singular, for reasons that will become clear shortly. Minimally singular models can be dealt with easier than more general singular models.- Show that this model is just as expressive as an ordinary normal model; they both can express all univariate normal distributions.
- Calculate the Fisher information matrix of the model and show that the model is singular. Further, calculate the eigenvectors of the matrix, and interpret them.
- Calculate the learning coefficient using the same procedure as 2c). Note that you will need a proper prior \(\varphi(a, b)\) this time, in order for the integral to converge: a uniform prior over the unit square should work, i.e. \(\varphi(a, b) = {0.25 \text{ if } |a|\leq1,|b|\leq1;, 0 \text{ otherwise}}\).
- As we saw from part b), this model has one direction in parameter space which changes the distribution \(p(x|\mu)\), and one direction in parameter space which does not - and this is true globally. It seems sensible to just ignore the direction in parameter space which is globally redundant. Apply the formula for the learning coefficient of regular models, but use \(d=1\) instead of \(d=2\). Compare your result to part c).
- It turns out that the strategy from part d), merely ignoring or quotienting out redundant submanifolds, works for any minimally singular model. However, many singular models, including neural networks, are not minimally singular. Why does this strategy not work for singular models in general? What would stop you from applying it to the model in problem 2), for instance?
- Consider a two-parameter alternatively-parameterized normal distribution: \(p(x|a,b)=\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(x-a^{k_1} b^{k_2})^2)\), where \(k_1,k_2 \in \mathbb{N}\) are fixed constants. Assume the true distribution is realizable with true parameters \(a_0\) and \(b_0\).
- Calculate the Fisher information matrix of the model. For what parameters is the model singular? For singular parameters, when is the rank of the matrix zero, and when is it one?
- Compute the learning coefficient using the same procedure as 1b), for \(a_0b_0 = 0\).
- In the Gray Book, Watanabe proves that any model with KL divergence of the form \(K(w_1, w_2, \ldots, w_r) = {w_1}^{2k_1} {w_2}^{2k_2} \ldots {w_r}^{2k_r}\), known as normal crossing form, has learning coefficient \(\lambda=\frac{1}{2\max({k_1, k_2, \ldots, k_3})}\) (in the case where the prior is non-vanishing). This result is more than a special case: Watanabe showed that any statistical model may (locally) be converted into normal crossing form, via a change of variables from algebraic geometry known as resolution of singularities.
Show that your answer from part b) matches Watanabe’s formula. - The general form of the asymptotic volume scaling formula is \(V(\epsilon) \propto \epsilon^\lambda , (-\log \epsilon)^{m-1}\), where \(\lambda\) is the learning coefficient and \(m\) is the multiplicity. So far we have been able to ignore the multiplicity, because we have only considered examples where \(m=1\). However, in this example, if \(k_1=k_2\), then the multiplicity \(m=2\). Show that the general form of the volume scaling formula holds when \(k_1=k_2\).
- *Repeat 3), but with the KL divergence \(K(a,b)\) from problem 5) instead. Vary \(a_0\), \(b_0\), \(k_1\), \(k_2\), and \(n\), and see how the functions change. In particular, note what happens near \(a_0 b_0=0\). (Math3D is helpful here.)\(\)
- Many practical machine learning models are conditional - they have inputs and outputs. Sometimes, as in the case of regression models, they may not naively appear to be probabilistic models at all. These properties are both demonstrated in univariate linear regression, a model given by \(y = mx+b\), where \(x \in \mathbb{R}\) is the input, \(m \in \mathbb{R}\) is the slope parameter and \(b \in \mathbb{R}\) is the intercept parameter. [Example 1.4, Gray Book].
- Assume we are doing least-squares regression (the loss function is mean squared error loss). Show how to write this model as a probabilistic statistical model: that is, find a formula for the model’s conditional probability \(p(y|x,m,b)\) such that the negative log-likelihood yields mean squared error loss. (Hint.)
- Assuming the true distribution \(q(y|x)\) is realizable with true slope \(m_0\) and true intercept \(b_0\), calculate the Fisher information matrix of this model. Use this to show that the model is regular.
- *Consider a constant-width, two-layer linear network \(y = BAx\), where \(x \in \mathbb{R}^n\) are the inputs, \(y \in \mathbb{R}^n\) are the outputs, and \(A, B \in \mathbb{R}^{n \times n}\) are linear transformations5. The parameter space is \(W =\mathbb{R}^{2n^2}\) and the parameters are the pair \((A, B) \in W\).
- Show that this two-layer linear model is just as expressive as an ordinary linear model; they both can express all linear functions from \(\mathbb{R}^n\) to \(\mathbb{R}^n\).
- Let us examine the singular directions of this model; that is, directions in parameter space which do not change the function implemented by the model, to first order. More precisely, the parameter-function map can be seen as a (nonlinear) map \(f: \mathbb{R}^{2n^2} \rightarrow \mathbb{R}^{n \times n}\), from parameters in \(\mathbb{R}^{2n^2}\) to linear transformations \(\mathbb{R}^{n \times n}\). A tangent direction \(\delta w\) in parameter space is a singular direction at a parameter \(w\) if the directional derivative \(D_{\delta w} f(w) = 0\).6
- Assume \(A\) and \(B\) are full rank. What are the singular directions?
- Assume \(A\) and/or \(B\) are not full rank. What are the singular directions now? How does the number of singular directions depend on the rank of \(A\) and \(B\)?
- Interpret these results. How does this make the two-layer linear model differ from the ordinary linear model?
- Note that the condition that \(A\) and/or \(B\) are not full rank is zero measure in the parameter space, which could lead to the interpretation that such a scenario will never occur in practice, and we should ignore the new singular directions that occur here. Why is this interpretation misleading? What happens if \(A\) and/or \(B\) are almost not full rank (have very small eigenvalues)?
- In the realizable case, Aoyagi & Watanabe (2005) derived a formula for the learning coefficient \(\lambda\), in terms of the rank and dimensions of the true parameters \(A_0\) and \(B_0\) [Main Theorem].
- Aoyagi & Watanabe do not assume the network is constant-width, like we do. Simplify their formula for \(\lambda\) using this assumption.
- How does the learning coefficient \(\lambda\) depend on \(r\), the rank of \(B_0 A_0\)? Compare with the results of part b).
- Repeat b.iii), but with \(\lambda\) instead.
- Repeat b.iv), but with \(\lambda\) instead. (Hint: see problem 2d).)
- The learning coefficient is a fairly global property of the KL divergence; in many cases, it is helpful to have a more local quantity. The local learning coefficient (LLC) of a parameter \(w^\) is defined to be the learning coefficient of the model with the support of the prior \(\varphi(w)\) restricted to an arbitrarily small neighborhood of \(w^\).
For illustration, we will consider the parameterized normal model \(p(x|\mu)=\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(x-(\mu(\mu-2)^2))^2)\)
for true parameter \(\mu_0=0\).- Find and plot \(K(\mu)\). Qualitatively, how does \(K(\mu)\) look near \(\mu=0\) versus \(\mu=2\)?
- Compute and compare the LLC of \(\mu^=0\) and \(\mu^=2\). How do these values formalize the qualitative intuition found via a)?
- Compute the (global) learning coefficient of the model. In general, the global learning coefficient is the minimum of the local learning coefficients for the minima of \(K(\mu)\). Confirm that this holds here.
- The global learning coefficient is the more important invariant for Bayesian learning, due to e.g. the free energy formula; why might the local learning coefficient be better suited for describing the complexity of a model where only a point value of \(w\) is estimated (e.g. using SGD)?
- *Up to this point, we have implicitly assumed that we have direct access to the true distribution \(q(x)\), for instance when calculating the learning coefficient using the KL divergence \(K(w)\). However, in practice, we can only access the true distribution indirectly, via \(n\) samples \(D_n = {X_1, X_2, \ldots, X_n}\) drawn IID from \(q(x)\). This is a significant gap.
Still, much like e.g. air resistance in physics, moving from idealized population quantities to realistic empirical quantities adds new complications, but much of the fundamental intuitions continue to hold. 1. We may use the negative log-likelihood \(L_n(w) = -\sum_{i} \log p(X_i|w)\)7 as a proxy for the KL divergence \(K(w) = \int q(x) \log\left(\frac{q(x)}{p(x|w)}\right) , dx\). Prove that \(\mathbb{E}_{D_n}[L_n(w)]=K(w) + S\), where \(S = - \int q(x) \log(q(x)) , dx\) is the (constant) entropy of the true distribution.8 2. Take the model from 2) with \(\mu_0=0\) and numerically sample from \(q(x)\) for \(n=10\), \(n=100\), and \(n=1000\).- Plot \(L_n(\mu)\) compared to \(K(\mu)+S\). How do the two differ, and what changes as \(n\) increases?
- Plot the unnormalized Bayesian posterior9 \(e^{-nL_n(\mu)}\) compared to \(e^{-nK(\mu)}\). How do the two differ, and what changes as \(n\) increases? It is much easier to reason about \(e^{-nK(\mu)}\) theoretically compared to \(e^{-nL_n(\mu)}\) - what does this suggest we can use \(\)\(e^{-nK(\mu)}\) for? 3. Take the model from 2) with \(\mu_0=0\) and numerically compute \(V(\epsilon)\) as a function of \(\epsilon\) using \(K(\mu)\). Verify that the proportionality \(V(\epsilon) \propto \epsilon^\lambda\) holds as expected, by plotting \(\log(V(\epsilon))\) versus \(\epsilon\). 4. Define the empirical version of \(V(\epsilon)\) as \(V_n(\epsilon) = \int_{{\mu\mid (L_n(\mu) - \min L_n(\mu)) < \epsilon}}\varphi(\mu) ; d\mu\). Repeat c), but with \(V_n(\epsilon)\) instead of \(V(\epsilon)\). How do the results differ? How well does the proportionality \(V_n(\epsilon) \propto \epsilon^\lambda\) hold for larger \(\epsilon\), and how well does it hold for smaller \(\epsilon\)? Based on what you saw in part b), give an intuitive explanation for why \(V_n(\epsilon)\) behaves differently for small \(\epsilon\).
- The tempered Bayesian posterior is the distribution \(p(w|D_n) = \frac{1}{Z_n} (\prod_{i} p(X_i|w)^\beta) \varphi(w) = \frac{1}{Z_n} e^{-n\beta L_n(w)} \varphi(w)\), where \(\beta\) is a parameter known as the inverse temperature, and \(Z_n = \int e^{-n\beta L_n(w)} \varphi(w) , dw\) is the normalizing constant known as the partition function or marginal likelihood. Note that the tempered Bayesian posterior results in the ordinary Bayesian posterior when \(\beta=1\). Qualitatively, describe what increasing \(\beta\) does to the tempered Bayesian posterior. How does it differ from the effect of increasing \(n\)?
- The free energy or log marginal likelihood is the quantity \(F_n(\beta) = -\log \int (\prod_{i} p(X_i|w)^\beta) \varphi(w) , dw = -\log \int e^{-n\beta L_n(w)} \varphi(w) , dw\)
obtained as the log of marginalizing the tempered Bayesian posterior at inverse temperature \(\beta\). Perhaps the most central result of SLT is the asymptotic expansion of the free energy:
\(F_n = n\beta S_n +\lambda \log n +O(\log\log(n))\),
where the empirical entropy \(S_n = -\frac{1}{n} \sum_{i} \log(q(X_i))\) [Gray Book, Main Formula II]. 1. Interpret the meaning of the free energy from the definition (it is the negative log of the marginal likelihood \(p(D_n)\)). Why is it advantageous to choose a model with lower free energy on a given dataset? 2. Take the model from 2) with \(\mu_0=0\), \(\beta=1\), and numerically compute the free energy for many values of \(\)\(n\) between \(n=100\) and \(n=10000\). Compare the calculated value to the estimate \(F_n \approx n\beta S_n +\lambda \log n\). 3. Repeat b), but with \(\mu_0=5\). 4. Repeat b), but with \(\mu_0=0.1\). Technically, the learning coefficient at \(\mu_0=0.1 \not = 0\) is still \(\lambda = \frac{1}{2}\), the same as part c), but note how the behavior of the free energy seems closer to b) than c) for low \(n\). The effects of the singularity at \(\mu=0\) extend to the neighborhood around it. - The local free energy of a subset of parameter space \(W \subset \mathcal{W}\) is the free energy of the model with the parameter space restricted to \(W\). We show how phase transitions can occur when different local free energies change at different rates. 1. Suppose we partition the overall parameter space into two subsets, \(W_1\) and \(W_2\). Denote the overall free energy by \(F(n)\) and the local free energies by \(F_1(n)\) and \(F_2(n)\), respectively. Show that the relationship \(F(n) = -\log(e^{-F_1(n)} , + , e^{-F_2(n)})\) holds. 2. Suppose \(F_1(n) \approx 0.3n + 20 \log n\) and \(F_2(n) \approx 0.5n + 2 \log n\). Plot the overall free energy \(F(n)\). Compare against \(\min{F_1(n), F_2(n)}\). What happens around \(n=570\)? 3. In statistical physics, a phase transition is traditionally defined as a discontinuity (or rapid change, for finite-size systems) in the derivatives of the free energy. Plot \(\frac{d}{dn}F(n)\) and explain why this justifies calling the phenomenon from b) a phase transition. 4. Change the coefficients of the terms in b). How does this change when a phase transition occurs? Does a phase transition always occur?
- We sometimes informally refer to singular models as “models where you can (to first order) move in some direction in parameter space without changing the function implemented by the model.” However, singular models are defined in terms of the Fisher information matrix; it may not be immediately clear how the informal statement follows.
Suppose we have a (potentially nonlinear) regression model given by a map \(f: W \rightarrow \mathcal{F}\) from a parameter space \(W = \mathbb{R}^d\) to a function space \(\mathcal{F}\) with outputs in \(\mathbb{R}^n\), for which we use mean squared error loss.10 We may write this as a statistical model: \(p(y|x,w)=\mathcal{N}(f(w)(x), I) = (2\pi)^{-d/2} \exp(-\frac{1}{2} ||y-f(w)(x)||^2)\)
where \(\mathcal{N}\) denotes the multivariate normal distribution and \(I\) is the identity matrix [See here]. 1. Show that \(\nabla_w \log (p(y|x,w)) = −(y−f(w)(x))^T \cdot J_f(w)(x)\) where \(J_f(w)(x)\) is the Jacobian of \(f(w)(x)\) with respect to \(w\). 2. Suppose that the model is realizable such that the true parameter is \(w\); that is, \(q(x,y) = p(y|x,w) q(x)\) for true input distribution \(q(x)\). Prove that the Fisher information matrix is given by \(I(w)=\mathbb{E}_{x \sim q(x)} [J_f(w)(x)^T J_f(w)(x)]\). 3. Prove that a vector \(v\) is in the null space of the Fisher information matrix \(I(w)\) if and only if it is in the null space of \(J_f(w)(x)\) for all \(x\) in the support of \(q(x)\).
Conclude that a regression model is singular at a parameter \(w \in W\) if and only if there exists a vector \(\)\(v \in W\) such that the directional derivative \(\nabla_v f(w)(x) = 0\) for all inputs \(x\) in the support of \(q(x)\). 4. In the case where the support of \(q(x)\) is the entire input space of \(\mathcal{F}\), show that the null space of the Fisher information matrix \(I(w)\) is equal to the null space of the total derivative \(D f\). - The naive method of determining “how singular a model is” at a particular parameter \(w\) is by counting the number of singular directions at \(w\); more precisely, using the rank of the Fisher information matrix11 at \(w\).
1. Construct two statistical models where the rank of the Fisher information matrix at the true parameter \(w_0\) is the same, but the learning coefficient differs between the two. (Hint: modify the model from problem 2).)
In this example, what information is the learning coefficient giving you that is missing from the rank of the Fisher information matrix? 2. The rank of the Fisher information matrix does give some information, however. Show that if the rank of the Fisher information matrix at \(w_0\) is \(r\), the learning coefficient satisfies the lower bound \(\lambda \geq \frac{r}{2}\). [Hard.] - In practice, the Hessian of the loss function is a more directly visible quantity than the Fisher information matrix; however, the two are closely connected. 1. Recall the population loss is given by \(L(w)=-\int q(x) \log p(x|w) ; dw\). Suppose that the true distribution is realizable with true parameter \(w_0\). Prove that the Hessian of the population loss at \(w_0\) is equal to the Fisher information matrix at \(w_0\)\( \). 2. Recall the empirical loss is given by \(L(w)=- \sum_{x_i} \log p(x_i|w)\), where \(x_i \sim q(x)\) IID. The empirical Fisher information matrix is given by \(I_n(w) = \sum_{x_i} \nabla \log p (x_i|w) \nabla \log p (x_i|w)^T\). Suppose that the true distribution is realizable with true parameter \(w_0\). Prove that the Hessian of the empirical loss at \(w_0\) is equal to the empirical Fisher information matrix at \(w_0\)\(\). 3. Suppose that a direction \(\delta w \in T\mathcal{W}\) is singular at a parameter \(w \in \mathcal{W}\) - that is, \(\delta w\) lies in the null space of the Fisher information matrix at \(w\). Prove that it also lies in the null space of the population Hessian and empirical Hessian at \(w\)\(\). (Note that unlike a) or b), you do not need to assume realizability or even that \(w\) is a local minimum of the loss.) 4. If one finds empirically that the loss Hessian has many zero eigenvalues at a parameter, what does this suggest about the Fisher information matrix at that parameter, and consequently about how singular the model is?
- Here we prove that the learning coefficient is invariant to diffeomorphism; that is, roughly, a smooth and invertible change of variables. A diffeomorphism is an invertible map \(\phi: W_1 \rightarrow W_2\) between two spaces \(W_1, W_2\) such that both \(W_1\) and \(W_2\) are infinitely differentiable. 1. Show that \(\phi_1(x)=3x + 2\) is a diffeomorphism from \(\mathbb{R}\) to \(\mathbb{R}\), and \(\phi_2(x) = \sin(x)\) is a diffeomorphism from \((-\frac{\pi}{2}, \frac{\pi}{2})\) to \((-1, 1)\). Explain why \(\phi_2(x)\) is not a diffeomorphism from \([-\frac{\pi}{2}, \frac{\pi}{2})\) to \([-1, 1)\). 2. Let \(p(x|w_2)\) be a model defined on the parameter space \(W_2\). We apply a diffeomorphism \(\phi: W_1 \rightarrow W_2\) to create the model \(p(x|\phi(w_1))\) defined on the parameter space \(W_1\). Abusing notation, let \(K(w_1) = \text{KL}(q(x) , || , p(x|\phi(w_1)))\) denote the KL divergence of the model on \(W_1\) and \(K(w_2) = \text{KL}(q(x) , || , p(x|w_2))\) the KL divergence of the model on \(W_2\). Define \(V_1(\epsilon) = \int_{{w_1|K(w_1)<\epsilon}} dw_1\) and \(V_2(\epsilon) = \int_{{w_2|K(w_2)<\epsilon}} dw_2\). Rewrite \(V_1(\epsilon)\) as an integral over \(W_2\) using the change of variables formula. 3. Using the result from part b) together with the fact that \(\phi\) is a diffeomorphism, show that there exist positive constants \(c_1,c_2\) such that \(c_1 V_2(\epsilon) \leq V_1(\epsilon) \leq c_2 V_2(\epsilon)\). 4. Using the result from c) together with the volume scaling formula for the learning coefficient \(\lambda\) and its multiplicity \(m\), prove that \(V_1(\epsilon) \propto V_2(\epsilon) \propto \epsilon^\lambda (\log \epsilon)^{m-1}\) as \(\epsilon \rightarrow 0\). Conclude that the learning coefficient and multiplicity of the two models is the same, and thus invariant to diffeomorphism.
- The role of the partition function \(Z_n(\beta)\) and free energy \(F_n(\beta)\) can be partially explained by their role as moment and cumulant generating functions of the posterior loss. This connection also leads to the scalable estimators of the free energy and learning coefficient used in practice.
Let the random variable \(\tilde{w}\beta\) be a sample from the tempered posterior \(\tilde{w}\beta \sim \frac{1}{Z_n} e^{-n \beta L_n(w)} \varphi(w)\). Then \(L_n(\tilde{w}\beta)\) is a real-valued random variable giving the negative log-likelihood of a randomly sampled parameter from the tempered posterior. 1. Recall that the moment generating function of a random variable \(X\) with PDF \(p(x)\) is given by \(M_X(s) = E[e^{sX}] = \int{-\infty}^{\infty} e^{sx} p(x) , dx\), and the partition function is given by \(Z_n(\beta) =\int e^{-n \beta L_n(w)} \varphi(w) , dw\).- Show that \(M_{nL_n(\tilde{w}_\beta)}(s) = \frac{Z_n(\beta-s)}{Z_n(\beta)}\).
- The \(i\)-th moment \(m_i\) of \(X\) is given by \(m_i = \frac{d^i}{d^is} M_X(s) |{s=0}\). Compute \(m_i\) for \(n L_n(\tilde{w}\beta)\). 2. Recall the cumulant generating function of a random variable \(X\) with PDF \(p(x)\) is given by \(K_X(s) = \log M_X(s) = \log \int_{-\infty}^{\infty} e^{sx} p(x) , dx\), and the free energy is given by \(F_n(\beta) = - \log \int e^{-n \beta L_n(w)} \varphi(w) , dw\).
- Show that \(K_{nL_n(\tilde{w}_\beta)}(s) = F_n(\beta) - F_n(\beta-s)\).
- \(\)The \(i\)-th cumulant \(\kappa_i\) of \(X\) is given by \(\kappa_i = \frac{d^i}{ds^i} K_X(s) |{s=0}\). Show that \(\kappa_i = (-1)^{i+1} \frac{d^i}{d^is}|{s=\beta} ; F_n(s)\) for \(n L_n(\tilde{w}_\beta)\). 3. We may use b.ii) to estimate \(F_n(1)\) or \(\lambda\) empirically from samples, far more efficiently than naive methods. The estimator of \(F_n(1)\) is known as the WBIC.
- Using b.ii), conclude that \(\mathbb{E}{\tilde{w}\beta}[n L_n(\tilde{w}\beta)] = \frac{d}{ds} |{s=\beta} ; F_n(s)\).
- The free energy formula can be written as \(F_n(\beta) \approx n\beta S_n + \lambda \log (n\beta)\). There exists an optimal inverse temperature \(\beta^\) such that \(\mathbb{E}{\tilde{w}{\beta^}}[nL_n(\tilde{w}_{\beta^})] = F_n(1)\). Find an approximate expression for \(\beta^\) using i) and assuming that the free energy formula holds after differentiating both sides, i.e.\(\frac{d}{d\beta} F_n(\beta) \approx \frac{d}{d\beta} (n\beta S_n + \lambda \log (n\beta))\).
- Instead of estimating \(F_n(\beta)\), we may estimate the learning coefficient \(\lambda\) directly. Use i) and the free energy formula to get an approximate expression for \(\lambda\) in terms of \(\beta\), \(S_n\), and \(\mathbb{E}{\tilde{w}\beta}[n L_n(\tilde{w}_\beta)]\). Is it still necessary to set \(\beta\) to an optimal inverse temperature for this estimator to work?
- A priori, it seems like the choice of prior distribution \(\varphi(w)\) could significantly affect the learning coefficient we get, via the volume scaling formula or otherwise. However, this is not the case - the learning coefficient remains the same so long as the prior is supported at \(\mu_0\) [Gray Book, Remark 6.7]. Thus, the prior has at most a secondary role in singular learning theory.
To demonstrate this, redo 1c) and 1d) but with a different prior distribution that also has support at \(\mu_0\).
References
- [The “Gray Book”] Algebraic Geometry and Statistical Learning Theory (Watanabe 2009)
- [The “Green Book”] Mathematical Theory of Bayesian Statistics (Watanabe 2018)
- Stochastic complexities of reduced rank regression in Bayesian estimation (Aoyagi & Watanabe 2005)
Footnotes
-
Technically, if we follow Watanabe’s terminology, a regular model is non-strictly-singular rather than non-singular. Watanabe defines singular models as a superset of regular models, so every regular model is also singular - non-regular models are referred to as strictly singular models. ↩
-
Technically, the learning coefficient is not the same thing as the real log-canonical threshold (RLCT); the learning coefficient is an invariant of a statistical system (model, truth, prior triplet), whereas the RLCT is an invariant of an analytic function. However, the RLCT of the model/truth KL divergence coincides with the learning coefficient if the prior is supported along the true parameter set. ↩
-
Note that in practice, analytically calculating or numerically estimating the learning coefficient directly via this volume scaling formula is completely intractable. Instead, methods based on the WBIC ()and MCMC are necessary. ↩
-
In the case where the multiplicity (m = 1). ↩
-
This kind of two-layer linear model is sometimes called reduced rank regression. ↩
-
By 14d), this is equivalent to the null space of the FIM. ↩
-
Note that the Gray Book uses (L_n(w)) to refer to the likelihood instead of the negative log-likelihood. ↩
-
Note that this pointwise expected value only tells us about (L_n(w)) for a fixed (w); it does not give us enough information to talk about properties of (L_n(w)) which depend on many (w), like volume scaling, the free energy, etc. Establishing this is highly nontrivial, and the Gray Book spends a significant amount of time doing so. ↩
-
Implicitly, with improper prior (\varphi(w)=1), but the prior isn’t important for intuition here. ↩
-
A neural network under mean squared error loss would satisfy these assumptions. ↩
-
In some cases, the rank of the Hessian may also be used here, given the correspondence between the FIM and the Hessian (see problem 16). ↩