Want to understand the average size of hidden representations?


You should! This is foundational for understanding everything else about the dynamics of neural networks.

Historically, the first questions people tried to answer about neural networks dealt with their performance and representations: how can we characterize how well our network performs, and what hidden representations do they learn as they train? We’ll revisit these questions later in a modern light, but suffice it to say that they are hard and it’s unclear where to start. In rigorous science, it’s usually a good idea to be humble about what you can understand and to start with the dumbest, simplest question you think you can answer, working up from there. It turns out that the simplest useful theoretical question you can ask about neural networks is: as you forward-propagate your signal and backprop your gradient, roughly how big are the (pre)activations and gradients on average? Answering this is useful for understanding initialization, avoiding exploding or vanishing gradients, and getting stable optimization even in large models.

More precisely: suppose we have an input vector \(\mathbf{x}\) (an image, token, set of features, etc.), and we start propagating forward through our network, stopping part way. Denote by \(\mathbf{h}_\ell(\mathbf{x})\) the hidden representations after applying the \(\ell\)-th linear layer. What’s the typical size of an element of \(\mathbf{h}_\ell(\mathbf{x})\)? If you want a mathematical metric for this, you might study the root-mean-squared size

\[ q_\ell(\mathbf{x}) := \frac{|\!|{\mathbf{h}_\ell(\mathbf{x})}|\!|}{\sqrt{\text{size}[\mathbf{h}_\ell(\mathbf{x})]}}. \]

You don’t want \(q_\ell(\mathbf{x})\) to either blow up or vanish as you propagate forwards through the network. If either happens, you’ll be feeding very large or very small arguments to your activation function,1 which is generally a bad idea.2 In a neural network of many layers, problems like this tend to get worse as you propagate through more and more layers, so you want to avoid them from the get go.

First steps: LeCun initialization and large width

The first people to address this question seriously were practitioners in the 1990s. If you initialize a neural network’s weight parameters with unit variance — that is, \(W_{ij} \sim \mathcal{N}(0,1)\) — then your preactivations tend to blow up. If instead you initialize with

\[ W_{ij} \sim \mathcal{N}(0, \sigma^2), \]

where \(\sigma^2 = \frac{1}{\text{[fan in]}}\), the preactivations are better controlled. By a central limit theorem calculation, this init size ensures that well-behaved activations in the previous layer mean well-behaved preactivations in the following layer, at least at initialization.

This is the first calculation any new deep learning scientist should be able to perform! Get comfortable with this kind of central limit theorem argument.

The central limit theorem gives the best approximation for a sum as the number of added terms grows. This suggests that, when studying any sort of signal propagation problem, it’s a good and useful idea to consider the case of large width. This is the consideration that motivates the study of infinite-width networks, which are now central in the mathematical science of deep learning. It’s very important to think about this enough to have intuition for why the large-width limit is useful.

Infinite-width nets at init: signal propagation and the neural network Gaussian process (NNGP)

A natural next question is: what else can you say about wide neural networks at initialization? The answer unfolds in the following sequence.

First, you can perform a close study of the “signal sizes” \(q_\ell(\mathbf{x})\) as well as the correlations \(c_\ell(\mathbf{x}, \mathbf{x}') := \langle \mathbf{h}_\ell(\mathbf{x}), \mathbf{h}_\ell(\mathbf{x}') \rangle\). You can actually calculate both of these exactly at infinite width using Gaussian integrals.

Next, you can study not only the averages of these quantities but also their complete distributions. It turns out they’re Gaussian at initialization (surprise, surprise) and the network function value itself is a “Gaussian process” with a covariance kernel that you can obtain in closed form.

It’s very worth working through the NNGP idea and getting intuition for both GPs and the forward-prop statistics that give a GP in this context. Notice that MLPs have NNGPs with rotation-invariant kernels. This will remain a useful intuition.

At this point in our discussion, we already have papers that have calculated average-case quantities exactly which agree well with experiments using networks with widths in the hundreds or thousands. Look at how good the agreement is in these plots:

Theory-experiment agreement plots

Top: signal propagation of layerwise correlations for a deep \(\tanh\) net from Poole et al. (2016). Blue, green, and red sets of curves correspond to weight initialization variances \(\sigma_w^2 = \{1.3, 2.3, 4.0\}\). Different saturations correspond to different initial correlations. Experiment dots lie very close to the theory curves. Bottom: performance of NNGP regression vs. ordered/chaotic regimes from Lee et al. (2017). Left subplot shows the test accuracy of GP regression with a depth-50 \(\tanh\) NNGP on MNIST. Right subplot shows prediction of the ordered and chaotic regimes using the same machinery as Poole et al. (2016). The best performance falls near the boundary between order and chaos. It’s significant that we can quantitatively predict the structure of a phase diagram of model performance, even in a simplified setting like NNGP regression.

It’s worth appreciating that extremely good agreement with experiment is possible if we’re studying the right objects in the right regimes. Most deep learning theory work that can’t get agreement this good eventually fades or is replaced by something that does. It’s usually wise to insist on a quantitative match from your theory and be satisfied with nothing less.

Infinite-width nets under gradient descent: the NTK

Now that we understand initialization in wide networks, we’re ready to study training. The first milestone on this path is the “neural tangent kernel” (NTK). The main result here is that if you train a neural network in the NNGP infinite-width setting, its functional evolution is described by a particular kernel which remains static for all time. This kernel is the inner product of parameter gradient vectors:

\[ \text{NTK}(\mathbf{x}, \mathbf{x}') := \left\langle \nabla_{\boldsymbol{\theta}} f_{\boldsymbol{\theta}}(\mathbf{x}), \nabla_{\boldsymbol{\theta}} f_{\boldsymbol{\theta}}(\mathbf{x}') \right\rangle. \]

The primary consequence is that, in this limit, the learning dynamics of a neural networks is, by the “kernel trick,” equivalent to the dynamics of a linear model. The final learned function is given by kernel (ridge) regression.

Motivated by the NTK, people found new and clever ways to ask and answer for kernel regression the questions we want answered of deep learning. This gave the field a useful new tool and has led to some moderately valuable insights. For example, networks in the NTK limit always converge to zero training loss so long as the learning rate isn’t too big, and this served as a useful demonstration of how overparameterization usually makes optimization easier, not harder. Kernel models will appear a few other times in other chapters of this guide.

Scaling analysis of feature evolution: the maximal update parameterization (\(\mu\)P)

After the development of the NTK, people quickly noticed that networks in this limit don’t exhibit feature learning. That is, at infinite width, the hidden neurons of a network represent the same functions after training as they did at initialization. At large-but-finite width, the change is finite but negligible. This is a first clue that the pretty, analytically tractable NTK limit isn’t the end of the story.3

For a few years, it seemed like we might have to give up on infinite-width networks. Fortunately, it turned out that there’s another coherent infinite-width limit in which things scale differently, and the network does actually undergo feature learning. This is the regime in which most deep learning theory now takes place.

Here’s the evolution of ideas, some key papers, and key takeaways:

These “rich,” feature learning, \(\mu\)P dynamics led to a paradigm shift in deep learning theory. Most later work uses or relates to it in some way. It’s thus very important to understand. A seasoned deep learning theorist should be able to sit down and derive the \(\mu\)P parameterization, or something equivalent to it, from first principles. It’s difficult to do relevant work in 2025 without it!

It’s worth noting that, unlike the NTK limit, the \(\mu\)P limit is very difficult to study analytically. In the NTK limit, we have kernel behavior, simple gradient descent dynamics, a convex loss surface (using squared loss), and lots of older theoretical tools that we can bring to bear. In the \(\mu\)P limit, we have none of this. To our knowledge, nobody’s even shown a general result that a deep network in the \(\mu\)P limit converges, let alone characterized the solution that’s found!

Open question: when does a network in the \(\mu\)P limit converge under gradient descent?

In the NTK limit, we can study the model with the well-established math of kernel theory, which already existed and has now been developed further expressly for the study of the NTK. In the \(\mu\)P limit, the best we have so far are rather complex calculational frameworks:

Open question: is there a simple calculational framework — potentially making realistic simplifying assumptions — that allows us to quantitatively study feature evolution in the rich regime?

Onwards: towards infinite depth

Early in this chapter, we took width to infinity, which allowed us a host of useful calculational tools. We can also take depth to infinity. There are several ways to do this, but the upshot is that one quickly encounters stability problems with a standard MLP, so a ResNet formulation in which each layer gets a small premultiplier seems like the most promising choice.

Open question: is there a simple calculational framework for studying the feature evolution of an infinite-depth network?


A Quickstart Guide to Learning Mechanics

  1. Introduction: what do you want to understand?
  2. ...the average size of hidden representations?
  3. ...hyperparameter selection (and why should theorists care)?
  4. 🚧 ...the convergence and stability of optimization?
  5. 🚧 ...feature learning and the final network weights?
  6. 🚧 ...generalization?
  7. 🚧 ...neuron-level sparsity?
  8. 🚧 ...the structure in the data?
  9. 🚧 Places to make a difference

Comments