By Charles Martin, Machine Learning Specialist
Editor's note: You can read the previous post in this series, Power Laws in Deep Learning, here.
Power Law Distributions in Deep Learning
In a previous post, we saw that the Fully Connected (FC) layers of the most common pre-trained Deep Learning display power law behavior. Specifically, for each FC weight matrix
For every FC matrix, the eigenvalue frequencies
where the exponents
Remarkably, the FC matrices all lie within the Universality Class of Fat Tailed Random Matrices!
Heavy Tailed Random Matrices
We define a random matrix by defining a matrix
- Gaussian Random Matrix: , whereis a Gaussian distribution
- Heavy Tailed Random Matrix: , whereis a power law distribution
In either case, Random Matrix Theory tells us what the asymptotic form of ESD should look like. But first, let’s see what model works best.
First, lets look at the ESD
Recall that AlexNet FC3 fits a power law with exponent $\alpha\sim&bg=ffffff $ , so we also plot the ESD on a log-log scale
AlexNet Layer FC3 Log Log Histogram of ESD
Notice that the distribution is linear in the central region, and the long tail cuts off sharply. This is typical of the ESDs for the fully connected (FC) layers of the all the pretrained models we have looked at so far. We now ask…
What kind of Random Matrix would make a good model for this ESD ?
ESDs: Gaussian random matrices
We first generate a few Gaussian Random matrices (mean 0, variance 1), for different aspect ratios Q, and plot the histogram of their eigenvalues.
Empirical Spectral Density (ESD) for Gaussian Random Matrices, with different Q values.
Notice that the shape of the ESD depends only on Q, and is tightly bounded; there is, in fact, effectively no tail at all to the distributions (except, perhaps, misleadingly for Q=1)
ESDs: Power Laws and Log Log Histograms
We can generate a heavy, or fat-tailed, random matrix as easily using the numpy Pareto function
Heavy Tailed Random matrices have a very ESDs. They have very long tails–so long, in fact, that it is better to plot them on a log log Histogram
Do any of these look like a plausible model for the ESDs of the weight matrices of a big DNN, like AlexNet ?
- the smallest exponent, (blue), has a very long tail, extending over 11 orders of magnitude. This means the largest eigenvalues would be. No real W would behave like this.
- the largest exponent, (red), has a very compact ESD, resembling more the Gaussian Ws above.
- the fat tailed ESD (green), however, is just about right. The ESD is linear in the central region, suggesting a power law. It is a little too large for our eigenvalues , but the tail also cuts off sharply, which is expected for any finite W . So we are close
Lets overlay the ESD of fat-tailed W with the actual empirical
We see a pretty good match to a Fat-tailed random matrix with
Turns out, there is something very special about
Random Matrix Theory predicts the shape of the ESD , in the asymptotic limit, for several kinds of Random Matrix, called University Classes. The 3 different values of
In particular, if we draw
What is more, the predicted ESDs have different, characteristic global and local shapes, for specific ranges of
the ESDs of the fully connected (FC) layers of pretrained DNNs all resemble the ESDs of the
But this is a little tricky to show, because we need to show that
RMT tells us that, for
And this works pretty well in practice for the Heavy Tailed Universality Class, for
Statistics of the maximum eigenvalue(s)
RMT not only tells us about the shape of the ESD; it makes statements about the statistics of the edge and/or tails — the fluctuations in the maximum eigenvalue
- Gaussian RMT:
- Fat Tailed RMT:
For standard, Gaussian RMT, the
In particular, the effects of M and Q kick in as soon as
And, for us, this affects how we estimate
Fat Tailed Matrices and the Finite Size Effects for
Here, we generate generate ESDs for 3 different Pareto Heavy tailed random matrices, with the fixed M (left) or N (right), but different Q. We fit each ESD to a Power Law. We then plot
The red lines are predicted by Heavy Tailed RMT (MP) theory, which works well for Heavy Tailed ESDs with
We can identify finite size matrices W that behave like the the Fat Tailed Universality Class of RMT (
It is amazing that Deep Neural Networks display this Universality in their weight matrices, and this suggests some deeper reason for Why Deep Learning Works.
Self Organized Criticality
In statistical physics, if a system displays a Power Laws, this can be evidence that it is operating near a critical point. It is known that real, spiking neurons display this behavior, called Self Organized Criticality
It appears that Deep Neural Networks may be operating under similar principles, and in future work, we will examine this relation in more detail.
The code for this post is in this github repo on ImplicitSelfRegularization
For more information, see this recorded talk on this topic: Why Deep Learning Works: Implicit Self-Regularization in Deep Neural Networks
Bio: Dr. Charles Martin is a specialist in Machine Learning,