deep learning is not unusually susceptible to adversarial examples
FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it. This 2015 blog post discusses how deep learning models are susceptible due to linearity, which makes intuitive sense to me; the dot product is a relatively bad measure of similarity between vectors. It proposes a strategy for finding adversarial examples for a random forest and says it hasn’t yet empirically been confirmed that random forests are unsafe. This empirical confirmation seems pretty important to me, because adversarial examples are only a thing because the wrong decision boundary has been learned. If the only way to create an “adversarial” example for a random forest is to permute an input until it genuinely appears to be a member of a different class, that doesn’t seem like a flaw. (I don’t expect that random forests always learn the correct decision boundaries, but my offhand guess would be that they are still less susceptible to adversarial examples than traditional deep models.)
I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).
From my perspective, a lot of AI safety challenges get vastly easier if you have the ability to train well-calibrated models for complex, unstructured data. If you have this ability, the AI’s model of human values doesn’t need to be perfect—since the model is well-calibrated, it knows what it does/does not know and can ask for clarification as necessary.
Calibration could also provide a very general solution for corrigibility: If the AI has a well-calibrated model of which actions are/are not corrigible, and just how bad various incorrigible actions are, then it can ask for clarification as needed on that too. Corrigibility learning allows for notions of corrigibility that are very fine-grained: you can tell the AI that preventing a bad guy from flipping its off switch is OK, but preventing a good guy from flipping its off switch is not OK. By training a model, you don’t have to spend a lot of time hand-engineering, and the model will hopefully generalize to incorrigible actions that the designers didn’t anticipate. (Per the calibration assumption, the model will usually either generalize correctly, or the AI will realize that it doesn’t know whether a novel plan would qualify as corrigible, and it can ask for clarification.)
That’s why I currently think improving ML models is the action with the highest leverage.
FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it.
“Explaining and Harnessing Adversarial Examples” (Goodfellow et al. 2014) is the original demonstration that “Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples”.
I’ll emphasize that high-dimensionality is a crucial piece of the puzzle, which I haven’t seen you bring up yet. You may already be aware of this, but I’ll emphasize it anyway: the usual intuitions do not even remotely apply in high-dimensional spaces. Check out Counterintuitive Properties of High Dimensional Space.
adversarial examples are only a thing because the wrong decision boundary has been learned
In my opinion, this is spot-on—not only your claim that there would be no adversarial examples if the decision boundary were perfect, but in fact a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems. Here are a few citations that point at some pieces of this case:
“Adversarial Spheres” (Gilmer et al. 2017) - “For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d).” (emphasis mine)
I think this paper is truly fantastic in many respects.
The central argument can be understood from the intuitions presented in Counterintuitive Properties of High Dimensional Space in the section titled Concentration of Measure (Figure 9). Where it says “As the dimension increases, the width of the band necessary to capture 99% of the surface area decreases rapidly.” you can just replace that with the “As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere”. “Small distance from the center of the sphere” is what gives rise to “Small epsilon at which you can find an adversarial example”.
“Intriguing Properties of Adversarial Examples” (Cubuk et al. 2017) - “While adversarial accuracy is strongly correlated with clean accuracy, it is only weakly correlated with model size”
I haven’t read this paper, but I’ve heard good things about it.
To summarize, my belief is that any model that is trying to learn a decision boundary in a high-dimensional space, and is basically built out of linear units with some nonlinearities, will be susceptible to small-perturbation adversarial examples so long as it makes any errors at all.
(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)
As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere
What does the center of the sphere represent in this case?
(I’m imaging the training and test sets consisting of points in a highly dimensional space, and the classifier as drawing a hyperplane to mostly separate them from each other. But I’m not sure what point in this space would correspond to the “center”, or what sphere we’d be talking about.)
“Adversarial Spheres” (Gilmer et al. 2017) - “For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d).” (emphasis mine)
Slightly off-topic, but quick terminology question. When I first read the abstract of this paper, I was very confused about what it was saying and had to re-read it several times, because of the way the word “tradeoff” was used.
I usually think of a tradeoff as a inverse relationship between two good things that you want both of. But in this case they use “tradeoff” to refer to the inverse relationship between “test error”, and “average distance to nearest error”. Which is odd, because the first of those is bad and the second is good, no?
Is there something I’m missing that causes this to sound like a more natural way of describing things to others’ ears?
Thanks for the links! (That goes for Wei and Paul too.)
a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems.
I’d expect this to be true or false depending on the shape of the misclassified region. If you think of the input space as a white sheet, and the misclassified region as red polka dots, then we measure test error by throwing a dart at the sheet and checking if it hits a polka dot. To measure adversarial vulnerability, we take a dart that landed on a white part of the sheet and check the distance to the nearest red polka dot. If the sheet is covered in tiny red polka dots, this distance will be small on average. If the sheet has just a few big red polka dots, this will be larger on average, even if the total amount of red is the same.
As a concrete example, suppose we trained a 1-nearest-neighbor classifier for 2-dimensional RGB images. Then the sheet is mostly red (because this is a terrible model), but there are splotches of white associated with each image in our training set. So this is a model that has lots of test error despite many spheres with 0% misclassifications.
To measure the size of the polka dots, you could invert the typical adversarial perturbation procedure: Start with a misclassified input and find the minimal perturbation necessary to make it correctly classified.
(It’s possible that this sheet analogy is misleading due to the nature of high-dimensional spaces.)
Anyway, this relates back to the original topic of conversation: the extent to which capabilities research and safety research are separate. If “adversarial vulnerability” and “amount of test set error” are inextricably linked, that suggests that reducing test set error (“capabilities” research) improves safety, and addressing adversarial vulnerability (“safety” research) advances capabilities. The extreme version of this position is that software advances are all good and hardware advances are all bad.
(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)
Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.
Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.
I might just be confused, but this sentence seems like a non sequitur to me. I understood catherio to be responding to your comment about googling and not finding “papers or blog posts supporting [the claim that deep learning is not unusually susceptible to adversarial examples]”.
If that was already clear to you then, never mind. I was just confused why you were talking about linking to things, when before the question seemed to be about what could be found by googling.
There doesn’t seem to be a lot of work on adversarial examples for random forests. This paper was the only one I found, but it says:
On a digit recognition task, we demonstrate that both gradient boosted trees and random forests are extremely susceptible to evasions.
Also if you look at Figure 3 and Figure 4 in the paper, it appears that the RF classifier is much more susceptible to adversarial examples than the NN classifier.
googling around, I wasn’t able to quickly find any papers or blog posts supporting it
I think it’s a little bit tricky because decision trees don’t work that well for the tasks where people usually study adversarial examples. And this isn’t my research area so I don’t know much about it.
That said, in addition to the paper Wei Dai linked, there is also this, showing that adversarial examples for neural nets transfer pretty well to decision trees (though I haven’t looked at that paper in any detail).
FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it. This 2015 blog post discusses how deep learning models are susceptible due to linearity, which makes intuitive sense to me; the dot product is a relatively bad measure of similarity between vectors. It proposes a strategy for finding adversarial examples for a random forest and says it hasn’t yet empirically been confirmed that random forests are unsafe. This empirical confirmation seems pretty important to me, because adversarial examples are only a thing because the wrong decision boundary has been learned. If the only way to create an “adversarial” example for a random forest is to permute an input until it genuinely appears to be a member of a different class, that doesn’t seem like a flaw. (I don’t expect that random forests always learn the correct decision boundaries, but my offhand guess would be that they are still less susceptible to adversarial examples than traditional deep models.)
From my perspective, a lot of AI safety challenges get vastly easier if you have the ability to train well-calibrated models for complex, unstructured data. If you have this ability, the AI’s model of human values doesn’t need to be perfect—since the model is well-calibrated, it knows what it does/does not know and can ask for clarification as necessary.
Calibration could also provide a very general solution for corrigibility: If the AI has a well-calibrated model of which actions are/are not corrigible, and just how bad various incorrigible actions are, then it can ask for clarification as needed on that too. Corrigibility learning allows for notions of corrigibility that are very fine-grained: you can tell the AI that preventing a bad guy from flipping its off switch is OK, but preventing a good guy from flipping its off switch is not OK. By training a model, you don’t have to spend a lot of time hand-engineering, and the model will hopefully generalize to incorrigible actions that the designers didn’t anticipate. (Per the calibration assumption, the model will usually either generalize correctly, or the AI will realize that it doesn’t know whether a novel plan would qualify as corrigible, and it can ask for clarification.)
That’s why I currently think improving ML models is the action with the highest leverage.
“Explaining and Harnessing Adversarial Examples” (Goodfellow et al. 2014) is the original demonstration that “Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples”.
I’ll emphasize that high-dimensionality is a crucial piece of the puzzle, which I haven’t seen you bring up yet. You may already be aware of this, but I’ll emphasize it anyway: the usual intuitions do not even remotely apply in high-dimensional spaces. Check out Counterintuitive Properties of High Dimensional Space.
In my opinion, this is spot-on—not only your claim that there would be no adversarial examples if the decision boundary were perfect, but in fact a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems. Here are a few citations that point at some pieces of this case:
“Adversarial Spheres” (Gilmer et al. 2017) - “For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d).” (emphasis mine)
I think this paper is truly fantastic in many respects.
The central argument can be understood from the intuitions presented in Counterintuitive Properties of High Dimensional Space in the section titled Concentration of Measure (Figure 9). Where it says “As the dimension increases, the width of the band necessary to capture 99% of the surface area decreases rapidly.” you can just replace that with the “As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere”. “Small distance from the center of the sphere” is what gives rise to “Small epsilon at which you can find an adversarial example”.
“Intriguing Properties of Adversarial Examples” (Cubuk et al. 2017) - “While adversarial accuracy is strongly correlated with clean accuracy, it is only weakly correlated with model size”
I haven’t read this paper, but I’ve heard good things about it.
To summarize, my belief is that any model that is trying to learn a decision boundary in a high-dimensional space, and is basically built out of linear units with some nonlinearities, will be susceptible to small-perturbation adversarial examples so long as it makes any errors at all.
(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)
What does the center of the sphere represent in this case?
(I’m imaging the training and test sets consisting of points in a highly dimensional space, and the classifier as drawing a hyperplane to mostly separate them from each other. But I’m not sure what point in this space would correspond to the “center”, or what sphere we’d be talking about.)
Thanks for this link, that is a handy reference!
Slightly off-topic, but quick terminology question. When I first read the abstract of this paper, I was very confused about what it was saying and had to re-read it several times, because of the way the word “tradeoff” was used.
I usually think of a tradeoff as a inverse relationship between two good things that you want both of. But in this case they use “tradeoff” to refer to the inverse relationship between “test error”, and “average distance to nearest error”. Which is odd, because the first of those is bad and the second is good, no?
Is there something I’m missing that causes this to sound like a more natural way of describing things to others’ ears?
Thanks for the links! (That goes for Wei and Paul too.)
I’d expect this to be true or false depending on the shape of the misclassified region. If you think of the input space as a white sheet, and the misclassified region as red polka dots, then we measure test error by throwing a dart at the sheet and checking if it hits a polka dot. To measure adversarial vulnerability, we take a dart that landed on a white part of the sheet and check the distance to the nearest red polka dot. If the sheet is covered in tiny red polka dots, this distance will be small on average. If the sheet has just a few big red polka dots, this will be larger on average, even if the total amount of red is the same.
As a concrete example, suppose we trained a 1-nearest-neighbor classifier for 2-dimensional RGB images. Then the sheet is mostly red (because this is a terrible model), but there are splotches of white associated with each image in our training set. So this is a model that has lots of test error despite many spheres with 0% misclassifications.
To measure the size of the polka dots, you could invert the typical adversarial perturbation procedure: Start with a misclassified input and find the minimal perturbation necessary to make it correctly classified.
(It’s possible that this sheet analogy is misleading due to the nature of high-dimensional spaces.)
Anyway, this relates back to the original topic of conversation: the extent to which capabilities research and safety research are separate. If “adversarial vulnerability” and “amount of test set error” are inextricably linked, that suggests that reducing test set error (“capabilities” research) improves safety, and addressing adversarial vulnerability (“safety” research) advances capabilities. The extreme version of this position is that software advances are all good and hardware advances are all bad.
Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.
I might just be confused, but this sentence seems like a non sequitur to me. I understood catherio to be responding to your comment about googling and not finding “papers or blog posts supporting [the claim that deep learning is not unusually susceptible to adversarial examples]”.
If that was already clear to you then, never mind. I was just confused why you were talking about linking to things, when before the question seemed to be about what could be found by googling.
Oh, that makes sense.
There doesn’t seem to be a lot of work on adversarial examples for random forests. This paper was the only one I found, but it says:
Also if you look at Figure 3 and Figure 4 in the paper, it appears that the RF classifier is much more susceptible to adversarial examples than the NN classifier.
I think it’s a little bit tricky because decision trees don’t work that well for the tasks where people usually study adversarial examples. And this isn’t my research area so I don’t know much about it.
That said, in addition to the paper Wei Dai linked, there is also this, showing that adversarial examples for neural nets transfer pretty well to decision trees (though I haven’t looked at that paper in any detail).