For example, if you use tanh instead of ReLU the simplicity bias is weaker. How does SLT explain/predict this?
It doesn’t. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn’t let you predict ahead of checking that the ReLU loss landscape will look better.
Maybe you meant that SLT predicts that good generalization occurs when an architecture’s preferred complexity matches the target function’s complexity?
That is closer to what I meant, but it isn’t quite what SLT says. The architecture doesn’t need to be biased toward the target function’s complexity. It just needs to always prefer simpler fits to more complex ones.
SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that’s the fit you’ll get. If there is no function with complexity 1 that matches the data, you’ll get a fit with complexity 2 instead. If there is no fit like that either, you’ll get complexity 3. And so on.
This is true of all neural nets, but the neural redshift paper claims that specific architectural decisions beat picking random points in the loss landscape. Neural redshift could be true in worlds where the SLT prediction was either true or false.
Sorry, I don’t understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right?
If you mean this
But unlike common wisdom, NNs do not have an inherent “simplicity bias”. This property depends on components such as ReLUs, residual connections, and layer normalizations.
then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don’t think common wisdom thinks otherwise.
We knew this, but the neural redshift paper claims that the simplicity bias is unrelated to training.
Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture?
The paper doesn’t just show a simplicity bias, it shows a bias for functions of a particular complexity that is simpler than random. To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What would a ‘simplicity bias’ be other than a bias towards things simpler than random in whatever space we are referring to? ‘Simpler than random’ is what people mean when they talk about simplicity biases.
To me this speaks against the likelihood of mesaoptimization, because it seems unlikely a mesaoptimizer would be similar in complexity to the training set if that training set did not describe an optimizer.
What do you mean by ‘similar complexity to the training set’? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything.
Though I hasten to add that SLT doesn’t actually say training prefers solutions with low K-complexity. A bias towards low learning coefficients seems to shake out in some sort of mix between a bias toward low K-complecity, and a bias towards speed.
That is closer to what I meant, but it isn’t quite what SLT says. The architecture doesn’t need to be biased toward the target function’s complexity. It just needs to always prefer simpler fits to more complex ones.
This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don’t just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.
It doesn’t. It just has neat language to talk about how the simplicity bias is reflected in the way the loss landscape of ReLU vs. tanh look different. It doesn’t let you predict ahead of checking that the ReLU loss landscape will look better.
That is closer to what I meant, but it isn’t quite what SLT says. The architecture doesn’t need to be biased toward the target function’s complexity. It just needs to always prefer simpler fits to more complex ones.
SLT says neural network training works because in a good nn architecture simple solutions take up exponentially more space in the loss landscape. So if you can fit the target function on the training data with a fit of complexity 1, that’s the fit you’ll get. If there is no function with complexity 1 that matches the data, you’ll get a fit with complexity 2 instead. If there is no fit like that either, you’ll get complexity 3. And so on.
Sorry, I don’t understand what you mean here. The paper takes different architectures and compares what functions you get if you pick a point at random from their parameter spaces, right?
If you mean this
then that claim is of course true. Making up architectures with bad inductive biases is easy, and I don’t think common wisdom thinks otherwise.
Sure, but for the question of whether mesa-optimisers will be selected for, why would it matter if the simplicity bias came from the updating rule instead of the architecture?
What would a ‘simplicity bias’ be other than a bias towards things simpler than random in whatever space we are referring to? ‘Simpler than random’ is what people mean when they talk about simplicity biases.
What do you mean by ‘similar complexity to the training set’? The message length of the training set is very likely going to be much longer than the message length of many mesa-optimisers, but that seems like an argument for mesa-optimiser selection if anything.
Though I hasten to add that SLT doesn’t actually say training prefers solutions with low K-complexity. A bias towards low learning coefficients seems to shake out in some sort of mix between a bias toward low K-complecity, and a bias towards speed.
This why the neural redshift paper says something different to SLT. It says neural nets that generalize well don’t just have a simplicity bias, they have a bias for functions with similar complexity to the target function. This brings into question mesaoptimization, because although mesaoptimization is favored by a simplicity bias, it is not necessarily favored by a bias toward equivalent simplicity to the target function.