Does it make sense to say there is no inductuive bias at work in modern ML models? Seems that clearly literally brute force searching ALL THE ALGORITHMS would still be unfeasible no matter how much compute you throw at it. Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
I agree with the claim that a little inductive bias has to be there, solely because AIXI is an utterly unrealistic model of how future AIs will look like, and even AIXI-tl is very infeasible, but I think the closer claim that is that the data matters way more than the architecture bias, which does turn out to be true.
One example is that attention turns out to be more or less replacable by MLP mixtures (have heard this from Gwern, but can’t verify this), or this link below:
That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).
In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.
I didn’t say there’s no inductive bias at work in models. Merely that trying to impose your own inductive biases on models is probably doomed for reasons best stated in the Bitter Lesson. The effectiveness of pretraining / scaling suggests that inductive biases work best when they are arrived at organically by very expressive models training on very large amounts of data.
Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
Edit: On reflection, the two examples above actually support the idea that the inductive bias of the architecture is immensely helpful to solving tasks (to the extent that the weights themselves don’t really matter). A better example of my point is that MLP-mixer models can be as good as CNNs on vision tasks despite having much smaller architectural inductive biases towards vision.
Sure, but it’s unclear whether these inductive biases are necessary.
My understanding is that they’re necessary even in principle, since there are an unbounded number of functions that fit any finite set of points. Even AIXI has a strong inductive bias, toward programs with the lowest Kolmogorov complexity.
Yeah, I agree that some inductive bias is probably necessary. But not all inductive biases are equal; some are much more general than others, and in particular I claim that ‘narrow’ inductive biases (e.g. specializing architectures to match domains) probably have net ~zero benefit compared to those learned from data
Interesting. But CNNs were developed originally for a reason to begin with, and MLP-mixer does mention a rather specific architecture as well as “modern regularization techniques”. I’d say all of that counts as baking in some inductive biases in the model though I agree it’s a very light touch.
Does it make sense to say there is no inductuive bias at work in modern ML models? Seems that clearly literally brute force searching ALL THE ALGORITHMS would still be unfeasible no matter how much compute you throw at it. Our models are very general, but when e.g. we use a diffusion model for images that exploits (and is biased towards) the kind of local structure we expect of images, when we use a transformer for text that exploits (and is biased towards) the kind of sequential pair-correlation you see in natural language, etc.
I agree with the claim that a little inductive bias has to be there, solely because AIXI is an utterly unrealistic model of how future AIs will look like, and even AIXI-tl is very infeasible, but I think the closer claim that is that the data matters way more than the architecture bias, which does turn out to be true.
One example is that attention turns out to be more or less replacable by MLP mixtures (have heard this from Gwern, but can’t verify this), or this link below:
https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dataset/
This is relevant for AI alignment and AI capabilities.
That sounds more like my intuition, though obviously there still have to be differences given that we keep using self-attention (quadratic in N) instead of MLPs (linear in N).
In the limit of infinite scaling, the fact that MLPs are universal function approximators is a guarantee that you can do anything with them. But obviously we still would rather have something that can actually work with less-than-infinite amounts of compute.
I didn’t say there’s no inductive bias at work in models. Merely that trying to impose your own inductive biases on models is probably doomed for reasons best stated in the Bitter Lesson. The effectiveness of pretraining / scaling suggests that inductive biases work best when they are arrived at organically by very expressive models training on very large amounts of data.
Sure, but it’s unclear whether these inductive biases are necessary.
Concrete example:randomly initialised CNN weights can extract sufficiently good features to linearly classify MNIST. Another example:randomly initialised transformers with only embeddings learnablecan do modular addition.Edit: On reflection, the two examples above actually support the idea that the inductive bias of the architecture is immensely helpful to solving tasks (to the extent that the weights themselves don’t really matter). A better example of my point is that MLP-mixer models can be as good as CNNs on vision tasks despite having much smaller architectural inductive biases towards vision.
My understanding is that they’re necessary even in principle, since there are an unbounded number of functions that fit any finite set of points. Even AIXI has a strong inductive bias, toward programs with the lowest Kolmogorov complexity.
Yeah, I agree that some inductive bias is probably necessary. But not all inductive biases are equal; some are much more general than others, and in particular I claim that ‘narrow’ inductive biases (e.g. specializing architectures to match domains) probably have net ~zero benefit compared to those learned from data
Interesting. But CNNs were developed originally for a reason to begin with, and MLP-mixer does mention a rather specific architecture as well as “modern regularization techniques”. I’d say all of that counts as baking in some inductive biases in the model though I agree it’s a very light touch.