Neural Redshift: Random Networks are not Random Functions shows that even randomly initialized neural nets tend to be simple functions (measured by frequency, polynomial order and compressibility), and that this bias can be partially attributed to ReLUs. Previous speculation on simplicity biases focused mostly on SGD, but this is now clearly not the only contributor.
The authors propose that good generalization occurs when an architecture’s preferred complexity matches the target function’s complexity. We should think about how compatible this is with our projections for how future neural nets might behave. For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?
As an aside: Research on inductive biases could be very impactful. My impression is that far less resources are spent studying inductive biases than interpretability, but inductive bias research could be feasible on small compute budgets, and tell us lots about what to expect as we scale neural nets.
Singular Learning Theory explains/predicts this. If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count. Just because most of the loss landscape is taken up by the biggest, and thus simplest, behavioural regions.
You can see this happening if you watch proxies for the effective parameter count while models train. E.g. a modular addition transformer or MNIST MLP start out with very few effective parameters at initialisation, then gain more as the network trains. If the network goes through a grokking transition, you can watch the effective parameter count go down again.
For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?
≈ no change I’d say. We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn’t work. So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don’t generalise out of distribution. SLT just tells us how that works.
One takeaway might be that observations about how biological brains train are more applicable to AI training than one might have previously thought. Previously, you could’ve figured that since AIs use variants of gradient descent as their updating algorithm, while the brain uses we-don’t-even-know-what, their inductive biases could be completely different.
Now, it’s looking like the updating rule you use doesn’t actually matter that much for determining the inductive bias. Anything in a wide class of local optimisation methods might give you pretty similar stuff. Some methods are a lot more efficient than others, but the real pixie fairy dust that makes any of this possible is in the architecture, not the updating rule.
(Obviously, it still matters what loss signalyou use. You can’t just expect that an AI will converge to learn the same desires a human brain would, unless the AI’s training signals are similar to those used by the human brain. And we don’t know what most of the brain’s training signals are.)
I think the predictions SLT makes are different from the results in the neural redshift paper. For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this? Maybe you meant that SLT predicts that good generalization occurs when an architecture’s preferred complexity matches the target function’s complexity?
The explanation you give sounds like a different claim however.
If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count
This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.
We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn’t work.
We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.
So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don’t generalise out of distribution.
The paper doesn’t just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this?
It doesn’t. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn’t let you predict ahead of checking that the ReLU loss landscape will look better.
Maybe you meant that SLT predicts that good generalization occurs when an architecture’s preferred complexity matches the target function’s complexity?
That is closer to what I meant, but it isn’t quite what SLT says. The architecture doesn’t need to be biased toward the target function’s complexity. It just needs to always prefer simpler fits to more complex ones.
SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that’s the fit you’ll get. If there is no function with complexity 1 that matches the data, you’ll get a fit with complexity 2 instead. If there is no fit like that either, you’ll get complexity 3. And so on.
This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.
Sorry, I don’t understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right?
If you mean this
But unlike common wisdom, NNs do not have an inherent “simplicity bias”. This property depends on components such as ReLUs, residual connections, and layer normalizations.
then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don’t think common wisdom thinks otherwise.
We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.
Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture?
The paper doesn’t just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What would a ‘simplicity bias’ be other than a bias towards things simpler than random in whatever space we are referring to? ‘Simpler than random’ is what people mean when they talk about simplicity biases.
To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What do you mean by ‘similar complexity to the training set’? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything.
Though I hasten to add that SLT doesn’t actually say training prefers solutions with low K-complexity. A bias towards low learning coefficients seems to shake out in some sort of mix between a bias toward low K-complecity, and a bias towards speed.
That is closer to what I meant, but it isn’t quite what SLT says. The architecture doesn’t need to be biased toward the target function’s complexity. It just needs to always prefer simpler fits to more complex ones.
This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don’t just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.
Neural Redshift: Random Networks are not Random Functions shows that even randomly initialized neural nets tend to be simple functions (measured by frequency, polynomial order and compressibility), and that this bias can be partially attributed to ReLUs. Previous speculation on simplicity biases focused mostly on SGD, but this is now clearly not the only contributor.
The authors propose that good generalization occurs when an architecture’s preferred complexity matches the target function’s complexity. We should think about how compatible this is with our projections for how future neural nets might behave. For example: If this proposition were true and a significant decider of generalization ability, would this make mesaoptimization less likely? More likely?
As an aside: Research on inductive biases could be very impactful. My impression is that far less resources are spent studying inductive biases than interpretability, but inductive bias research could be feasible on small compute budgets, and tell us lots about what to expect as we scale neural nets.
Singular Learning Theory explains/predicts this. If you go to a random point in the loss landscape, you very likely land in a large region implementing the same behaviour, meaning the network has a small effective parameter count. Just because most of the loss landscape is taken up by the biggest, and thus simplest, behavioural regions.
You can see this happening if you watch proxies for the effective parameter count while models train. E.g. a modular addition transformer or MNIST MLP start out with very few effective parameters at initialisation, then gain more as the network trains. If the network goes through a grokking transition, you can watch the effective parameter count go down again.
≈ no change I’d say. We already knew neural network training had a bias towards algorithmic simplicity of some kind, because otherwise it wouldn’t work. So we knew general algorithms, like mesa-optimisers, would be preferred over memorised solutions that don’t generalise out of distribution. SLT just tells us how that works.
One takeaway might be that observations about how biological brains train are more applicable to AI training than one might have previously thought. Previously, you could’ve figured that since AIs use variants of gradient descent as their updating algorithm, while the brain uses we-don’t-even-know-what, their inductive biases could be completely different.
Now, it’s looking like the updating rule you use doesn’t actually matter that much for determining the inductive bias. Anything in a wide class of local optimisation methods might give you pretty similar stuff. Some methods are a lot more efficient than others, but the real pixie fairy dust that makes any of this possible is in the architecture, not the updating rule.
(Obviously, it still matters what loss signal you use. You can’t just expect that an AI will converge to learn the same desires a human brain would, unless the AI’s training signals are similar to those used by the human brain. And we don’t know what most of the brain’s training signals are.)
I think the predictions SLT makes are different from the results in the neural redshift paper. For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this? Maybe you meant that SLT predicts that good generalization occurs when an architecture’s preferred complexity matches the target function’s complexity?
The explanation you give sounds like a different claim however.
This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.
We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.
The paper doesn’t just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
It doesn’t. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn’t let you predict ahead of checking that the ReLU loss landscape will look better.
That is closer to what I meant, but it isn’t quite what SLT says. The architecture doesn’t need to be biased toward the target function’s complexity. It just needs to always prefer simpler fits to more complex ones.
SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that’s the fit you’ll get. If there is no function with complexity 1 that matches the data, you’ll get a fit with complexity 2 instead. If there is no fit like that either, you’ll get complexity 3. And so on.
Sorry, I don’t understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right?
If you mean this
then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don’t think common wisdom thinks otherwise.
Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture?
What would a ‘simplicity bias’ be other than a bias towards things simpler than random in whatever space we are referring to? ‘Simpler than random’ is what people mean when they talk about simplicity biases.
What do you mean by ‘similar complexity to the training set’? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything.
Though I hasten to add that SLT doesn’t actually say training prefers solutions with low K-complexity. A bias towards low learning coefficients seems to shake out in some sort of mix between a bias toward low K-complecity, and a bias towards speed.
This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don’t just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.