I like the idea of optimizing for career growth & AI safety separately. However, I’m not sure the difference between “capabilities research” and “safety research” is as clear-cut as Critch makes it sound.
Consider the problem of making ML more data-efficient. Superficially, this is “capabilities research”: I don’t think it appears on any AI safety research agenda, and it’s an established mainstream research area.
However, in order to do value learning, I think we’ll want ML to become much more data-efficient than it is currently. If ML is not data-efficient, then assembling a dataset for our values will be time-consuming, which might tempt arms race participants to cut corners.
And if we could make ML really data-efficient, that gets us closer to “do what I mean” systems where you give it a few examples of things to do/not do and it’s able to correctly infer your intent.
So does that mean the AI safety community should work on making ML more data-efficient? I’m not sure. I can think of arguments on both sides.
But my personal view is that answering these kind of “differential capabilities research” questions is higher-impact than a lot of the AI safety work that is being done. As far as I can tell, most existing AI safety work either
(a) Treats safety as a applications problem, where we try to use existing AI techniques to prototype what safe systems might look like. But I expect such prototypes will be thrown away as the state of the art advances. Arguably, you hit the point of diminishing returns with this approach as soon as you finish your architecture diagram (since that’s the part that’s least likely to change as the field advances).
(b) Treats safety as a security problem, where we try to think of flaws AI systems might have and how we might guard against them. But flaws only exist in the context of particular systems. The C programming language has a lot of security issues due to the fact that strings are null-terminated. There’s a massive cottage industry built around exploiting and guarding against C-specific issues. But this is all historically contingent: We only care about this because C is a popular programming language. If C was not popular, this cottage industry wouldn’t exist.
Instead I would suggest a third approach:
(c) Treat safety as a differential technological development problem. Try to figure out which capabilities are on the critical path for FAI but not on the critical path for UFAI. Try to evaluate competing AI paradigms and forecast which could most easily evolve into a secure system, then try to improve benchmarks for that platform so it can win the standards war. If none of the existing paradigms seem likely to be adequate, maybe devise a new paradigm de novo. Don’t forget about sociological factors.
Note that approach (c) looks a lot more like “capabilities research” than “safety research”. It requires careful judgement calls by domain experts. Work of types (a) and (b) will likely be useful to inform those judgement calls. But (c) is the way to go in the long run, IMO. If you were an effective altruist living during the 1980s trying to ensure that computers of the future would be secure, I think promoting the adoption of a non-C programming language would likely be the highest-leverage thing to do.
[This ended up being a pretty long tangent, maybe I should make this comment into a toplevel post? Perhaps people could tell me if/why they disagree first.]
Well, here is a list of paradigms that might overtake deep learning. This list could probably be expanded, e.g. by researching various attempts to integrate deep learning with Bayesian reasoning, create more interpretable models, etc.
Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc. Additionally, there are pragmatic considerations related to whether a particular paradigm has a serious hope of widespread adoption. How competitive is it? Does it address researchercomplaints about deep learning?
Then you could create a 2d matrix with paradigms on one axis and desiderata on another axis. For each paradigm/desiderata combo, figure out if that paradigm satisfies, or could be improved to satisfy, that desiderata. As you do this you’d probably get ideas for new rows/columns in your matrix.
Then you could look at your matrix and try to figure out which paradigms are most promising for FAI—or if none seem good enough, invent a new one. Do technology evangelism for the chosen paradigm(s). Try to improve the paradigm’s resume of accomplishments. Rally the AI safety community.
Computer science, and AI in particular, have always been hype-driven. The market in paradigms doesn’t seem efficient or driven purely by questions of technical merit. And there can be a lot of path-dependence. As AI safety concerns gain mindshare, I think we stand a solid chance of influencing which paradigms gain traction.
Another approach to differential capabilities development is to try to identify an application of AI which shares a lot of features with the AI safety problem and demonstrate its commercial viability. For example, self-driving cars are safety-critical in nature, which seems good. But they also must make real-time decisions, whereas it is probably desirable for an FAI to spend time pondering the nature of our values, ask us clarifying questions, etc.
Fun fact: Silicon Valley’s new behemothinvestor is a believer in the technological singularity. It’s too bad the Singularity Summit is not still a thing or he could be invited to speak.
Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc.
For most of these examples, the current research in safety is more like “Try to find any approach that has a hope of satisfying that desideratum while being competitive.”
So your matrix just ends up being a lot of “no” or “maybe if we did more research.”
It seems correct that people are trying to “find some approach that might work” before they try “rally the community around an approach that might work.”
Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested. So an extreme level of pessimism doesn’t seem especially well-justified to me. It seems relatively common for a task to be hard in one framework while being easy in another.
Standard CFAR advice: Instead of assuming a problem is unsolvable, sit down and try to think of a solution for a timed 5 minutes. Has anyone spent a timed 5 minutes trying to figure out, say, how vulnerable gcForest is likely to be to adversarial examples? You don’t necessarily have to solve all the problems yourself, either: 5 minutes of research is enough to determine that creating models which “correctly capture uncertainty” seems to be one of Uber’s design goals with Pyro (which seems related to calibration/robustness to distributional shift).
BTW, I’ve spent a fair amount of time thinking about & reading about creativity, and I don’t think extreme pessimism is at all conducive to generating ideas. If your evidence for a problem being hard is “I couldn’t think of any good approaches”, and you were pretty sure there weren’t any good approaches before you started thinking, I don’t find that evidence super compelling.
It seems correct that people are trying to “find some approach that might work” before they try “rally the community around an approach that might work.”
I agree. That’s why I suggested going breadth-first initially.
Even if pessimism is justified, I think a breadth-first approach is sensible if it’s possible to estimate the difficulty of overcoming various problems in the context of various frameworks in advance. If making any progress at all is expected to be hard, all the more reason to choose targets strategically.
Has anyone spent a timed 5 minutes trying to figure out, say, how vulnerable gcForest is likely to be to adversarial examples?
Yes. (Answer: deep learning is not unusually susceptible to adversarial examples.)
5 minutes of research is enough to determine that creating models which “correctly capture uncertainty” seems to be one of Uber’s design goals with Pyro (which seems related to calibration/robustness to distributional shift)
In fact there is a (vast) literature on this topic.
Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested.
Go for it.
It seems relatively common for a task to be hard in one framework while being easy in another.
I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).
deep learning is not unusually susceptible to adversarial examples
FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it. This 2015 blog post discusses how deep learning models are susceptible due to linearity, which makes intuitive sense to me; the dot product is a relatively bad measure of similarity between vectors. It proposes a strategy for finding adversarial examples for a random forest and says it hasn’t yet empirically been confirmed that random forests are unsafe. This empirical confirmation seems pretty important to me, because adversarial examples are only a thing because the wrong decision boundary has been learned. If the only way to create an “adversarial” example for a random forest is to permute an input until it genuinely appears to be a member of a different class, that doesn’t seem like a flaw. (I don’t expect that random forests always learn the correct decision boundaries, but my offhand guess would be that they are still less susceptible to adversarial examples than traditional deep models.)
I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).
From my perspective, a lot of AI safety challenges get vastly easier if you have the ability to train well-calibrated models for complex, unstructured data. If you have this ability, the AI’s model of human values doesn’t need to be perfect—since the model is well-calibrated, it knows what it does/does not know and can ask for clarification as necessary.
Calibration could also provide a very general solution for corrigibility: If the AI has a well-calibrated model of which actions are/are not corrigible, and just how bad various incorrigible actions are, then it can ask for clarification as needed on that too. Corrigibility learning allows for notions of corrigibility that are very fine-grained: you can tell the AI that preventing a bad guy from flipping its off switch is OK, but preventing a good guy from flipping its off switch is not OK. By training a model, you don’t have to spend a lot of time hand-engineering, and the model will hopefully generalize to incorrigible actions that the designers didn’t anticipate. (Per the calibration assumption, the model will usually either generalize correctly, or the AI will realize that it doesn’t know whether a novel plan would qualify as corrigible, and it can ask for clarification.)
That’s why I currently think improving ML models is the action with the highest leverage.
FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it.
“Explaining and Harnessing Adversarial Examples” (Goodfellow et al. 2014) is the original demonstration that “Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples”.
I’ll emphasize that high-dimensionality is a crucial piece of the puzzle, which I haven’t seen you bring up yet. You may already be aware of this, but I’ll emphasize it anyway: the usual intuitions do not even remotely apply in high-dimensional spaces. Check out Counterintuitive Properties of High Dimensional Space.
adversarial examples are only a thing because the wrong decision boundary has been learned
In my opinion, this is spot-on—not only your claim that there would be no adversarial examples if the decision boundary were perfect, but in fact a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems. Here are a few citations that point at some pieces of this case:
“Adversarial Spheres” (Gilmer et al. 2017) - “For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d).” (emphasis mine)
I think this paper is truly fantastic in many respects.
The central argument can be understood from the intuitions presented in Counterintuitive Properties of High Dimensional Space in the section titled Concentration of Measure (Figure 9). Where it says “As the dimension increases, the width of the band necessary to capture 99% of the surface area decreases rapidly.” you can just replace that with the “As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere”. “Small distance from the center of the sphere” is what gives rise to “Small epsilon at which you can find an adversarial example”.
“Intriguing Properties of Adversarial Examples” (Cubuk et al. 2017) - “While adversarial accuracy is strongly correlated with clean accuracy, it is only weakly correlated with model size”
I haven’t read this paper, but I’ve heard good things about it.
To summarize, my belief is that any model that is trying to learn a decision boundary in a high-dimensional space, and is basically built out of linear units with some nonlinearities, will be susceptible to small-perturbation adversarial examples so long as it makes any errors at all.
(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)
As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere
What does the center of the sphere represent in this case?
(I’m imaging the training and test sets consisting of points in a highly dimensional space, and the classifier as drawing a hyperplane to mostly separate them from each other. But I’m not sure what point in this space would correspond to the “center”, or what sphere we’d be talking about.)
“Adversarial Spheres” (Gilmer et al. 2017) - “For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d).” (emphasis mine)
Slightly off-topic, but quick terminology question. When I first read the abstract of this paper, I was very confused about what it was saying and had to re-read it several times, because of the way the word “tradeoff” was used.
I usually think of a tradeoff as a inverse relationship between two good things that you want both of. But in this case they use “tradeoff” to refer to the inverse relationship between “test error”, and “average distance to nearest error”. Which is odd, because the first of those is bad and the second is good, no?
Is there something I’m missing that causes this to sound like a more natural way of describing things to others’ ears?
Thanks for the links! (That goes for Wei and Paul too.)
a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems.
I’d expect this to be true or false depending on the shape of the misclassified region. If you think of the input space as a white sheet, and the misclassified region as red polka dots, then we measure test error by throwing a dart at the sheet and checking if it hits a polka dot. To measure adversarial vulnerability, we take a dart that landed on a white part of the sheet and check the distance to the nearest red polka dot. If the sheet is covered in tiny red polka dots, this distance will be small on average. If the sheet has just a few big red polka dots, this will be larger on average, even if the total amount of red is the same.
As a concrete example, suppose we trained a 1-nearest-neighbor classifier for 2-dimensional RGB images. Then the sheet is mostly red (because this is a terrible model), but there are splotches of white associated with each image in our training set. So this is a model that has lots of test error despite many spheres with 0% misclassifications.
To measure the size of the polka dots, you could invert the typical adversarial perturbation procedure: Start with a misclassified input and find the minimal perturbation necessary to make it correctly classified.
(It’s possible that this sheet analogy is misleading due to the nature of high-dimensional spaces.)
Anyway, this relates back to the original topic of conversation: the extent to which capabilities research and safety research are separate. If “adversarial vulnerability” and “amount of test set error” are inextricably linked, that suggests that reducing test set error (“capabilities” research) improves safety, and addressing adversarial vulnerability (“safety” research) advances capabilities. The extreme version of this position is that software advances are all good and hardware advances are all bad.
(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)
Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.
Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.
I might just be confused, but this sentence seems like a non sequitur to me. I understood catherio to be responding to your comment about googling and not finding “papers or blog posts supporting [the claim that deep learning is not unusually susceptible to adversarial examples]”.
If that was already clear to you then, never mind. I was just confused why you were talking about linking to things, when before the question seemed to be about what could be found by googling.
There doesn’t seem to be a lot of work on adversarial examples for random forests. This paper was the only one I found, but it says:
On a digit recognition task, we demonstrate that both gradient boosted trees and random forests are extremely susceptible to evasions.
Also if you look at Figure 3 and Figure 4 in the paper, it appears that the RF classifier is much more susceptible to adversarial examples than the NN classifier.
googling around, I wasn’t able to quickly find any papers or blog posts supporting it
I think it’s a little bit tricky because decision trees don’t work that well for the tasks where people usually study adversarial examples. And this isn’t my research area so I don’t know much about it.
That said, in addition to the paper Wei Dai linked, there is also this, showing that adversarial examples for neural nets transfer pretty well to decision trees (though I haven’t looked at that paper in any detail).
Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested.
I think blog posts are potentially weird measures of effort, here. I also think that this is something that people are interested in doing—I think it’s a component of MIRI’s strategic sketch here, as part 8--but isn’t the sort of thing where we have anything particularly worthwhile to show for it yet.
Perhaps it makes sense to sketch an argument for why none of the standard paradigms satisfy some desideratum? This is kind of what AI Safety Gridworlds did. But it’s more the thing where, say, gradient boosted random forests have more of the ‘transparency’ property in a particular, legalistic way (it’s easier to figure out blame for any particular classification than it would be with a neural net) but not in the way that we actually care about (looking at a gradient boosted random forest, we could figure out if it’s thinking about things in the way that we want it to be thinking about), which might actually be easier with a neural net (because we could look at what neuron activations correspond to).
I like the idea of optimizing for career growth & AI safety separately. However, I’m not sure the difference between “capabilities research” and “safety research” is as clear-cut as Critch makes it sound.
Consider the problem of making ML more data-efficient. Superficially, this is “capabilities research”: I don’t think it appears on any AI safety research agenda, and it’s an established mainstream research area.
However, in order to do value learning, I think we’ll want ML to become much more data-efficient than it is currently. If ML is not data-efficient, then assembling a dataset for our values will be time-consuming, which might tempt arms race participants to cut corners.
And if we could make ML really data-efficient, that gets us closer to “do what I mean” systems where you give it a few examples of things to do/not do and it’s able to correctly infer your intent.
So does that mean the AI safety community should work on making ML more data-efficient? I’m not sure. I can think of arguments on both sides.
But my personal view is that answering these kind of “differential capabilities research” questions is higher-impact than a lot of the AI safety work that is being done. As far as I can tell, most existing AI safety work either
(a) Treats safety as a applications problem, where we try to use existing AI techniques to prototype what safe systems might look like. But I expect such prototypes will be thrown away as the state of the art advances. Arguably, you hit the point of diminishing returns with this approach as soon as you finish your architecture diagram (since that’s the part that’s least likely to change as the field advances).
(b) Treats safety as a security problem, where we try to think of flaws AI systems might have and how we might guard against them. But flaws only exist in the context of particular systems. The C programming language has a lot of security issues due to the fact that strings are null-terminated. There’s a massive cottage industry built around exploiting and guarding against C-specific issues. But this is all historically contingent: We only care about this because C is a popular programming language. If C was not popular, this cottage industry wouldn’t exist.
Instead I would suggest a third approach:
(c) Treat safety as a differential technological development problem. Try to figure out which capabilities are on the critical path for FAI but not on the critical path for UFAI. Try to evaluate competing AI paradigms and forecast which could most easily evolve into a secure system, then try to improve benchmarks for that platform so it can win the standards war. If none of the existing paradigms seem likely to be adequate, maybe devise a new paradigm de novo. Don’t forget about sociological factors.
Note that approach (c) looks a lot more like “capabilities research” than “safety research”. It requires careful judgement calls by domain experts. Work of types (a) and (b) will likely be useful to inform those judgement calls. But (c) is the way to go in the long run, IMO. If you were an effective altruist living during the 1980s trying to ensure that computers of the future would be secure, I think promoting the adoption of a non-C programming language would likely be the highest-leverage thing to do.
[This ended up being a pretty long tangent, maybe I should make this comment into a toplevel post? Perhaps people could tell me if/why they disagree first.]
Can you explain (c) a bit more? What specifically should someone be doing now, if they want to do (c)?
Well, here is a list of paradigms that might overtake deep learning. This list could probably be expanded, e.g. by researching various attempts to integrate deep learning with Bayesian reasoning, create more interpretable models, etc.
Then you could come up with a list of desiderata we seek in a paradigm: resistance to adversarial examples, robustness to distributional shift, interpretability, conservative concepts, calibration, etc. Additionally, there are pragmatic considerations related to whether a particular paradigm has a serious hope of widespread adoption. How competitive is it? Does it address researcher complaints about deep learning?
Then you could create a 2d matrix with paradigms on one axis and desiderata on another axis. For each paradigm/desiderata combo, figure out if that paradigm satisfies, or could be improved to satisfy, that desiderata. As you do this you’d probably get ideas for new rows/columns in your matrix.
Then you could look at your matrix and try to figure out which paradigms are most promising for FAI—or if none seem good enough, invent a new one. Do technology evangelism for the chosen paradigm(s). Try to improve the paradigm’s resume of accomplishments. Rally the AI safety community.
Computer science, and AI in particular, have always been hype-driven. The market in paradigms doesn’t seem efficient or driven purely by questions of technical merit. And there can be a lot of path-dependence. As AI safety concerns gain mindshare, I think we stand a solid chance of influencing which paradigms gain traction.
Another approach to differential capabilities development is to try to identify an application of AI which shares a lot of features with the AI safety problem and demonstrate its commercial viability. For example, self-driving cars are safety-critical in nature, which seems good. But they also must make real-time decisions, whereas it is probably desirable for an FAI to spend time pondering the nature of our values, ask us clarifying questions, etc.
Fun fact: Silicon Valley’s new behemoth investor is a believer in the technological singularity. It’s too bad the Singularity Summit is not still a thing or he could be invited to speak.
For most of these examples, the current research in safety is more like “Try to find any approach that has a hope of satisfying that desideratum while being competitive.”
So your matrix just ends up being a lot of “no” or “maybe if we did more research.”
It seems correct that people are trying to “find some approach that might work” before they try “rally the community around an approach that might work.”
Well, I haven’t seen even a blog post’s worth of effort put into doing something like what I suggested. So an extreme level of pessimism doesn’t seem especially well-justified to me. It seems relatively common for a task to be hard in one framework while being easy in another.
Standard CFAR advice: Instead of assuming a problem is unsolvable, sit down and try to think of a solution for a timed 5 minutes. Has anyone spent a timed 5 minutes trying to figure out, say, how vulnerable gcForest is likely to be to adversarial examples? You don’t necessarily have to solve all the problems yourself, either: 5 minutes of research is enough to determine that creating models which “correctly capture uncertainty” seems to be one of Uber’s design goals with Pyro (which seems related to calibration/robustness to distributional shift).
BTW, I’ve spent a fair amount of time thinking about & reading about creativity, and I don’t think extreme pessimism is at all conducive to generating ideas. If your evidence for a problem being hard is “I couldn’t think of any good approaches”, and you were pretty sure there weren’t any good approaches before you started thinking, I don’t find that evidence super compelling.
I agree. That’s why I suggested going breadth-first initially.
Even if pessimism is justified, I think a breadth-first approach is sensible if it’s possible to estimate the difficulty of overcoming various problems in the context of various frameworks in advance. If making any progress at all is expected to be hard, all the more reason to choose targets strategically.
Yes. (Answer: deep learning is not unusually susceptible to adversarial examples.)
In fact there is a (vast) literature on this topic.
Go for it.
I agree. But note that from the perspective of the AI safety research that I do, none of the frameworks in the post you link change the basic picture, except maybe for hierarchical temporal memory (which seems like a non-starter).
FWIW, this claim doesn’t match my intuition, and googling around, I wasn’t able to quickly find any papers or blog posts supporting it. This 2015 blog post discusses how deep learning models are susceptible due to linearity, which makes intuitive sense to me; the dot product is a relatively bad measure of similarity between vectors. It proposes a strategy for finding adversarial examples for a random forest and says it hasn’t yet empirically been confirmed that random forests are unsafe. This empirical confirmation seems pretty important to me, because adversarial examples are only a thing because the wrong decision boundary has been learned. If the only way to create an “adversarial” example for a random forest is to permute an input until it genuinely appears to be a member of a different class, that doesn’t seem like a flaw. (I don’t expect that random forests always learn the correct decision boundaries, but my offhand guess would be that they are still less susceptible to adversarial examples than traditional deep models.)
From my perspective, a lot of AI safety challenges get vastly easier if you have the ability to train well-calibrated models for complex, unstructured data. If you have this ability, the AI’s model of human values doesn’t need to be perfect—since the model is well-calibrated, it knows what it does/does not know and can ask for clarification as necessary.
Calibration could also provide a very general solution for corrigibility: If the AI has a well-calibrated model of which actions are/are not corrigible, and just how bad various incorrigible actions are, then it can ask for clarification as needed on that too. Corrigibility learning allows for notions of corrigibility that are very fine-grained: you can tell the AI that preventing a bad guy from flipping its off switch is OK, but preventing a good guy from flipping its off switch is not OK. By training a model, you don’t have to spend a lot of time hand-engineering, and the model will hopefully generalize to incorrigible actions that the designers didn’t anticipate. (Per the calibration assumption, the model will usually either generalize correctly, or the AI will realize that it doesn’t know whether a novel plan would qualify as corrigible, and it can ask for clarification.)
That’s why I currently think improving ML models is the action with the highest leverage.
“Explaining and Harnessing Adversarial Examples” (Goodfellow et al. 2014) is the original demonstration that “Linear behavior in high-dimensional spaces is sufficient to cause adversarial examples”.
I’ll emphasize that high-dimensionality is a crucial piece of the puzzle, which I haven’t seen you bring up yet. You may already be aware of this, but I’ll emphasize it anyway: the usual intuitions do not even remotely apply in high-dimensional spaces. Check out Counterintuitive Properties of High Dimensional Space.
In my opinion, this is spot-on—not only your claim that there would be no adversarial examples if the decision boundary were perfect, but in fact a group of researchers are beginning to think that in a broader sense “adversarial vulnerability” and “amount of test set error” are inextricably linked in a deep and foundational way—that they may not even be two separate problems. Here are a few citations that point at some pieces of this case:
“Adversarial Spheres” (Gilmer et al. 2017) - “For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size O(1/√d).” (emphasis mine)
I think this paper is truly fantastic in many respects.
The central argument can be understood from the intuitions presented in Counterintuitive Properties of High Dimensional Space in the section titled Concentration of Measure (Figure 9). Where it says “As the dimension increases, the width of the band necessary to capture 99% of the surface area decreases rapidly.” you can just replace that with the “As the dimension increases, a decision-boundary hyperplane that has 1% test error rapidly gets extremely close to the equator of the sphere”. “Small distance from the center of the sphere” is what gives rise to “Small epsilon at which you can find an adversarial example”.
“Intriguing Properties of Adversarial Examples” (Cubuk et al. 2017) - “While adversarial accuracy is strongly correlated with clean accuracy, it is only weakly correlated with model size”
I haven’t read this paper, but I’ve heard good things about it.
To summarize, my belief is that any model that is trying to learn a decision boundary in a high-dimensional space, and is basically built out of linear units with some nonlinearities, will be susceptible to small-perturbation adversarial examples so long as it makes any errors at all.
(As a note—not trying to be snarky, just trying to be genuinely helpful, Cubuk et al. 2017 and Goodfellow et al. 2014 are my top two hits for “adversarial examples linearity” in an incognito tab)
What does the center of the sphere represent in this case?
(I’m imaging the training and test sets consisting of points in a highly dimensional space, and the classifier as drawing a hyperplane to mostly separate them from each other. But I’m not sure what point in this space would correspond to the “center”, or what sphere we’d be talking about.)
Thanks for this link, that is a handy reference!
Slightly off-topic, but quick terminology question. When I first read the abstract of this paper, I was very confused about what it was saying and had to re-read it several times, because of the way the word “tradeoff” was used.
I usually think of a tradeoff as a inverse relationship between two good things that you want both of. But in this case they use “tradeoff” to refer to the inverse relationship between “test error”, and “average distance to nearest error”. Which is odd, because the first of those is bad and the second is good, no?
Is there something I’m missing that causes this to sound like a more natural way of describing things to others’ ears?
Thanks for the links! (That goes for Wei and Paul too.)
I’d expect this to be true or false depending on the shape of the misclassified region. If you think of the input space as a white sheet, and the misclassified region as red polka dots, then we measure test error by throwing a dart at the sheet and checking if it hits a polka dot. To measure adversarial vulnerability, we take a dart that landed on a white part of the sheet and check the distance to the nearest red polka dot. If the sheet is covered in tiny red polka dots, this distance will be small on average. If the sheet has just a few big red polka dots, this will be larger on average, even if the total amount of red is the same.
As a concrete example, suppose we trained a 1-nearest-neighbor classifier for 2-dimensional RGB images. Then the sheet is mostly red (because this is a terrible model), but there are splotches of white associated with each image in our training set. So this is a model that has lots of test error despite many spheres with 0% misclassifications.
To measure the size of the polka dots, you could invert the typical adversarial perturbation procedure: Start with a misclassified input and find the minimal perturbation necessary to make it correctly classified.
(It’s possible that this sheet analogy is misleading due to the nature of high-dimensional spaces.)
Anyway, this relates back to the original topic of conversation: the extent to which capabilities research and safety research are separate. If “adversarial vulnerability” and “amount of test set error” are inextricably linked, that suggests that reducing test set error (“capabilities” research) improves safety, and addressing adversarial vulnerability (“safety” research) advances capabilities. The extreme version of this position is that software advances are all good and hardware advances are all bad.
Thanks. I’d seen both papers, but I don’t like linking to things I haven’t fully read.
I might just be confused, but this sentence seems like a non sequitur to me. I understood catherio to be responding to your comment about googling and not finding “papers or blog posts supporting [the claim that deep learning is not unusually susceptible to adversarial examples]”.
If that was already clear to you then, never mind. I was just confused why you were talking about linking to things, when before the question seemed to be about what could be found by googling.
Oh, that makes sense.
There doesn’t seem to be a lot of work on adversarial examples for random forests. This paper was the only one I found, but it says:
Also if you look at Figure 3 and Figure 4 in the paper, it appears that the RF classifier is much more susceptible to adversarial examples than the NN classifier.
I think it’s a little bit tricky because decision trees don’t work that well for the tasks where people usually study adversarial examples. And this isn’t my research area so I don’t know much about it.
That said, in addition to the paper Wei Dai linked, there is also this, showing that adversarial examples for neural nets transfer pretty well to decision trees (though I haven’t looked at that paper in any detail).
I think blog posts are potentially weird measures of effort, here. I also think that this is something that people are interested in doing—I think it’s a component of MIRI’s strategic sketch here, as part 8--but isn’t the sort of thing where we have anything particularly worthwhile to show for it yet.
Perhaps it makes sense to sketch an argument for why none of the standard paradigms satisfy some desideratum? This is kind of what AI Safety Gridworlds did. But it’s more the thing where, say, gradient boosted random forests have more of the ‘transparency’ property in a particular, legalistic way (it’s easier to figure out blame for any particular classification than it would be with a neural net) but not in the way that we actually care about (looking at a gradient boosted random forest, we could figure out if it’s thinking about things in the way that we want it to be thinking about), which might actually be easier with a neural net (because we could look at what neuron activations correspond to).