Here we apparently need transparent, formally-specified distance function if we have any hope of absolutely proving the absence of adversarial examples.
Well, a classifier that is 100% accurate would also do the job ;) (I’m not sure a 100% accurate classifier is feasible per se, but a classifier which can be made arbitrarily accurate given enough data/compute/life-long learning experience seems potentially feasible.)
Also, small perturbations aren’t necessarily the only way to construct adversarial examples. Suppose I want to attack a model M1, which I have access to, and I also have a more accurate model M2. Then I could execute an automated search for cases where M1 and M2 disagree. (Maybe I use gradient descent on the input space, maximizing an objective function corresponding to the level of disagreement between M1 and M2.) Then I hire people on Mechanical Turk to look through the disagreements and flag the ones where M1 is wrong. (Since M2 is more accurate, M1 will “usually” be wrong.)
This is actually one way to look at what’s going on with traditional small perturbation adversarial examples. M1 is a deep learning model and M2 is a 1-nearest-neighbor model—not very good in general, but quite accurate in the immediate region of data points with known labels. The problem is that deep learning models don’t have a very strong inductive bias towards mapping nearby inputs to nearby outputs (sometimes called “Lipschitzness”). L2 regularization actually makes deep learning models more Lipschitz because smaller coefficients=smaller eigenvalues for weight matrices=less capacity to stretch nearby inputs away from each other in output space. I think maybe that’s part of why L2 regularization works.
Hoping to expand the previous two paragraphs into a paper with Matthew Barnett before too long—if anyone wants to help us get it published, please send me a PM (neither of us has ever published a paper before).
Well, a classifier that is 100% accurate would also do the job ;) (I’m not sure a 100% accurate classifier is feasible per se, but a classifier which can be made arbitrarily accurate given enough data/compute/life-long learning experience seems potentially feasible.)
Also, small perturbations aren’t necessarily the only way to construct adversarial examples. Suppose I want to attack a model M1, which I have access to, and I also have a more accurate model M2. Then I could execute an automated search for cases where M1 and M2 disagree. (Maybe I use gradient descent on the input space, maximizing an objective function corresponding to the level of disagreement between M1 and M2.) Then I hire people on Mechanical Turk to look through the disagreements and flag the ones where M1 is wrong. (Since M2 is more accurate, M1 will “usually” be wrong.)
This is actually one way to look at what’s going on with traditional small perturbation adversarial examples. M1 is a deep learning model and M2 is a 1-nearest-neighbor model—not very good in general, but quite accurate in the immediate region of data points with known labels. The problem is that deep learning models don’t have a very strong inductive bias towards mapping nearby inputs to nearby outputs (sometimes called “Lipschitzness”). L2 regularization actually makes deep learning models more Lipschitz because smaller coefficients=smaller eigenvalues for weight matrices=less capacity to stretch nearby inputs away from each other in output space. I think maybe that’s part of why L2 regularization works.
Hoping to expand the previous two paragraphs into a paper with Matthew Barnett before too long—if anyone wants to help us get it published, please send me a PM (neither of us has ever published a paper before).