Adversarial examples suggest to me that by default ML systems don’t necessarily learn what we want them to learn:
They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
They don’t handle contradictory evidence in a reasonable way, i.e., giving a confident answer when high frequency features (pixel-level details) and low frequency features (overall shape) point to different answers.
The evidence from adversarial training suggests to me that AT is merely patching symptoms (e.g., making the ML system de-emphasize certain specific features) and not fixing the underlying problem. At least this is my impression from watching this video on Adversarial Robustness, specifically the chapters on Adversarial Arms Race and Unforeseen Adversaries.
Aside from this, it’s also unclear how to apply AT to your original motivation:
A function that tells your AI system whether an action looks good and is right virtually all of the time on natural inputs isn’t safe if you use it to drive an enormous search for unnatural (highly optimized) inputs on which it might behave very differently.
because in order to apply AT we need a model of what “attacks” the adversary is allowed to do (in this case the “attacker” is a superintelligence trying to optimize the universe, so we have to model it as being allowed to do anything?) and also ground-truth training labels.
For this purpose, I don’t think we can use the standard AT practice of assuming that any data point within a certain distance of a human-labeled instance, according to some metric, has the same label as that instance. Suppose we instead let the training process query humans directly for training labels (i.e., how good some situation is) on arbitrary data points, well that’s slow/costly if the process isn’t very sample efficient (which modern ML isn’t), and also scary if human implementations of human values may already have adversarial examples. (The “perceptual wormholes” work and other evidence suggest that humans also aren’t 100% adversarially robust.)
My own thinking is that we probably need to go beyond adversarial training for this, along the lines of solving metaphilosophy and then using that solution to find/fix existing adversarial examples and correctly generalize human values out of distribution.
They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
This was found to not be true at scale!
It doesn’t even feel that true w/weaker vision transformers, seems specific to convnets.
I bet smaller animal brains have similar problems.
Do you know if it is happening naturally from increased scale, or only correlated with scale (people are intentionally trying to correct the “misalignment” between ML and humans of shape vs texture bias by changing aspects of the ML system like its training and architecture, and simultaneously increasing scale)? I somewhat suspect the latter due the existence of a benchmark that the paper seems to target (“humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias”).
In either case, it seems kind of bad that it has taken a decade or two to get to this point from when adversarial examples were first noticed, and it’s unclear whether other adversarial examples or “misalignment” remain in the vision transformer. If the first transformative AIs don’t quite learn the right values due to having a different inductive bias from humans, it may not matter much that 10 years later the problem would be solved.
adversarial examples definitely still exist but they’ll look less weird to you because of the shape bias.
anyway this is a random visual model, raw perception without any kind of reflective error correction loop, I’m not sure what you expect it to do differently, or what conclusion you’re trying to draw from how it does behave? the inductive bias doesn’t precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that’s exactly what you’d expect for any approximately Bayesian setup.
the shape bias increasing with scale was definitely conjectured long before it was tested. ML scaling is very recent though,and this experiment was quite expensive. Remember when GPT-2 came out and everyone thought that was a big model? This is an image classifier which is over 10x larger than that. They needed a giant image classification dataset which I don’t think even existed 5 years ago.
the inductive bias doesn’t precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that’s exactly what you’d expect for any approximately Bayesian setup.
I can certainly understand that as you scale both architectures, they both make less mistakes on distribution. But do they also generalize out of training distribution more similarly? If so, why? Can you explain this more? (I’m not getting your point from just “approximately Bayesian setup”.)
They needed a giant image classification dataset which I don’t think even existed 5 years ago.
This is also confusing/concerning for me. Why would it be necessary or helpful to have such a large dataset to align the shape/texture bias with humans?
But do they also generalize out of training distribution more similarly? If so, why?
Neither of them is going to generalize very well out of distribution, and to the extent they do it will be via looking for features that were present in-distribution. The old adage “to imagine 10-dimensional space, first imagine 3-space, then say 10 really hard”.
My guess is that basically every learning system which tractably approximates Bayesian updating on noisy high dimensional data is going to end up with roughly Gaussian OOD behavior. There’s been some experiments where (non-adversarially-chosen) OOD samples quickly degrade to uniform prior, but I don’t think that’s been super robustly studied.
The way humans generalize OOD is not that our visual systems are natively equipped to generalize to contexts they have no way of knowing about, that would be a true violation of no-free-lunch theorems, but that through linguistic reflection & deliberate experimentation some of us can sometimes get a handle on the new domain, and then we use language to communicate that handle to others who come up with things we didn’t, etc. OOD generalization is a process at the (sub)cultural & whole-nervous-system level, not something that individual chunks of the brain can do well on their own.
This is also confusing/concerning for me. Why would it be necessary or helpful to have such a large dataset to align the shape/texture bias with humans?
Well it might not be, but you need large datasets to motivate studying large models, as their performance on small datasets like imagenet is often only marginally better.
A 20b param ViT trained on 10m images at 224x224x3 is approximately 1 param for every 75 subpixels, and 2000 params for every image. Classification is an easy enough objective that it very likely just overfits, unless you regularize it a ton, at which point it might still have the expected shape bias at great expense. Training a 20b param model is expensive, I don’t think anyone has ever spent that much on a mere imagenet classifier, and public datasets >10x the size of imagenet with any kind of labels only started getting collected in 2021.
To motivate this a bit, humans don’t see in frames but let’s pretend we do. At 60fps for 12h/day for 10 years, that’s nearly 9.5 billion frames. Imagenet is 10 million images. Our visual cortex contains somewhere around 5 billion neurons, which is around 50 trillion parameters (at 1 param / synapse & 10k synapses / neuron, which is a number I remember being reasonable for the whole brain but vision might be 1 or 2 OOM special in either direction).
Because of the strange loopy nature of concepts/language/self/different problems metaphilosophy seems unsolvable? Asking: What is good? already implies that there are the concepts “good”, “what”, “being” that there are answers and questions … Now we could ask what concepts or questions to use instead …
Similarly: > “What are all the things we can do with the things we have and what decision-making process will we use and why use that process if the character of the different processes is the production of different ends; don’t we have to know which end is desired in order to choose the decision-making process that also arrives at that result?” > Which leads back to desire and knowing what you want without needing a system to tell you what you want.
It’s all empty in the Buddhist sense. It all depends on which concepts or turing machines or which physical laws you start with.
They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
Could you fix this part by adding high frequency noise to the images prior to training? Maybe lots of copies of each image with different noise patterns?
Adversarial examples suggest to me that by default ML systems don’t necessarily learn what we want them to learn:
They put too much emphasis on high frequency features, suggesting a different inductive bias from humans.
They don’t handle contradictory evidence in a reasonable way, i.e., giving a confident answer when high frequency features (pixel-level details) and low frequency features (overall shape) point to different answers.
The evidence from adversarial training suggests to me that AT is merely patching symptoms (e.g., making the ML system de-emphasize certain specific features) and not fixing the underlying problem. At least this is my impression from watching this video on Adversarial Robustness, specifically the chapters on Adversarial Arms Race and Unforeseen Adversaries.
Aside from this, it’s also unclear how to apply AT to your original motivation:
because in order to apply AT we need a model of what “attacks” the adversary is allowed to do (in this case the “attacker” is a superintelligence trying to optimize the universe, so we have to model it as being allowed to do anything?) and also ground-truth training labels.
For this purpose, I don’t think we can use the standard AT practice of assuming that any data point within a certain distance of a human-labeled instance, according to some metric, has the same label as that instance. Suppose we instead let the training process query humans directly for training labels (i.e., how good some situation is) on arbitrary data points, well that’s slow/costly if the process isn’t very sample efficient (which modern ML isn’t), and also scary if human implementations of human values may already have adversarial examples. (The “perceptual wormholes” work and other evidence suggest that humans also aren’t 100% adversarially robust.)
My own thinking is that we probably need to go beyond adversarial training for this, along the lines of solving metaphilosophy and then using that solution to find/fix existing adversarial examples and correctly generalize human values out of distribution.
This was found to not be true at scale! It doesn’t even feel that true w/weaker vision transformers, seems specific to convnets. I bet smaller animal brains have similar problems.
Do you know if it is happening naturally from increased scale, or only correlated with scale (people are intentionally trying to correct the “misalignment” between ML and humans of shape vs texture bias by changing aspects of the ML system like its training and architecture, and simultaneously increasing scale)? I somewhat suspect the latter due the existence of a benchmark that the paper seems to target (“humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias”).
In either case, it seems kind of bad that it has taken a decade or two to get to this point from when adversarial examples were first noticed, and it’s unclear whether other adversarial examples or “misalignment” remain in the vision transformer. If the first transformative AIs don’t quite learn the right values due to having a different inductive bias from humans, it may not matter much that 10 years later the problem would be solved.
adversarial examples definitely still exist but they’ll look less weird to you because of the shape bias.
anyway this is a random visual model, raw perception without any kind of reflective error correction loop, I’m not sure what you expect it to do differently, or what conclusion you’re trying to draw from how it does behave? the inductive bias doesn’t precisely match human vision, so it has different mistakes, but as you scale both architectures they become more similar. that’s exactly what you’d expect for any approximately Bayesian setup.
the shape bias increasing with scale was definitely conjectured long before it was tested. ML scaling is very recent though,and this experiment was quite expensive. Remember when GPT-2 came out and everyone thought that was a big model? This is an image classifier which is over 10x larger than that. They needed a giant image classification dataset which I don’t think even existed 5 years ago.
I can certainly understand that as you scale both architectures, they both make less mistakes on distribution. But do they also generalize out of training distribution more similarly? If so, why? Can you explain this more? (I’m not getting your point from just “approximately Bayesian setup”.)
This is also confusing/concerning for me. Why would it be necessary or helpful to have such a large dataset to align the shape/texture bias with humans?
Neither of them is going to generalize very well out of distribution, and to the extent they do it will be via looking for features that were present in-distribution. The old adage “to imagine 10-dimensional space, first imagine 3-space, then say 10 really hard”.
My guess is that basically every learning system which tractably approximates Bayesian updating on noisy high dimensional data is going to end up with roughly Gaussian OOD behavior. There’s been some experiments where (non-adversarially-chosen) OOD samples quickly degrade to uniform prior, but I don’t think that’s been super robustly studied.
The way humans generalize OOD is not that our visual systems are natively equipped to generalize to contexts they have no way of knowing about, that would be a true violation of no-free-lunch theorems, but that through linguistic reflection & deliberate experimentation some of us can sometimes get a handle on the new domain, and then we use language to communicate that handle to others who come up with things we didn’t, etc. OOD generalization is a process at the (sub)cultural & whole-nervous-system level, not something that individual chunks of the brain can do well on their own.
Well it might not be, but you need large datasets to motivate studying large models, as their performance on small datasets like imagenet is often only marginally better.
A 20b param ViT trained on 10m images at 224x224x3 is approximately 1 param for every 75 subpixels, and 2000 params for every image. Classification is an easy enough objective that it very likely just overfits, unless you regularize it a ton, at which point it might still have the expected shape bias at great expense. Training a 20b param model is expensive, I don’t think anyone has ever spent that much on a mere imagenet classifier, and public datasets >10x the size of imagenet with any kind of labels only started getting collected in 2021.
To motivate this a bit, humans don’t see in frames but let’s pretend we do. At 60fps for 12h/day for 10 years, that’s nearly 9.5 billion frames. Imagenet is 10 million images. Our visual cortex contains somewhere around 5 billion neurons, which is around 50 trillion parameters (at 1 param / synapse & 10k synapses / neuron, which is a number I remember being reasonable for the whole brain but vision might be 1 or 2 OOM special in either direction).
Because of the strange loopy nature of concepts/language/self/different problems metaphilosophy seems unsolvable?
Asking: What is good? already implies that there are the concepts “good”, “what”, “being” that there are answers and questions … Now we could ask what concepts or questions to use instead …
Similarly:
> “What are all the things we can do with the things we have and what decision-making process will we use and why use that process if the character of the different processes is the production of different ends; don’t we have to know which end is desired in order to choose the decision-making process that also arrives at that result?”
> Which leads back to desire and knowing what you want without needing a system to tell you what you want.
It’s all empty in the Buddhist sense. It all depends on which concepts or turing machines or which physical laws you start with.
Could you fix this part by adding high frequency noise to the images prior to training? Maybe lots of copies of each image with different noise patterns?