I’ve given a lightning talk twice now about “why should we care about adversarial examples?”, highlighting the distinction (which ML experts are usually aware of, but media and policy folks often miss) between an unsolved research problem and a real-world threat model.
I’ve now published a version of this as a general-audience Medium post.
For a LessWrong / Alignment Forum audience, I’ll highlight the fact that I’m directly and consciously drawing on the “intellectual puzzle” vs. “instrumental strategy” framing from Abram Demski’s Embedded Curiosities post (at the conclusion of the Embedded Agency series).
I’m mostly making an observation at the level of “how a research field works”, and find it interesting that agent foundations and adversarial examples research are both hitting similar confusions when justifying the importance of their work.
I also think a lot more could be said and explored on the object-level about which angles on “adversarial examples” are likely to yield valuable insights in the long run, and which are dead ends. But this post isn’t really about that.
I don’t know to what extent researchers themselves agree with this point—it seems like there is a lot of adversarial examples research that is looking at the imperceptible perturbation case and many papers that talk about new types of adversarial examples, without really explaining why they are doing this or giving a motivation that is about unsolved research problems rather than real world settings. It’s possible that researchers do think of it as a research problem and not a real world problem, but present their papers differently because they think that’s necessary in order to be accepted.
The distinction between research problems and real world threat models seem to parallel the distinction between theoretical or conceptual research and engineering in AI safety. (I would include not just Agent Foundations in the former, but also a lot of CHAI’s work, e.g. CIRL, so I don’t know if this is the same thing you’re pointing at.) The former typically asks questions of the form “how could we do this in principle, making simplifying assumptions X, Y and Z”, even though X, Y and Z are known not to hold in the real world, for the sake of having greater conceptual clarity that can later be leveraged as a solution to a real world problem. Engineering work on the other hand is typically trying to scale an approach to a more complex environment (with the eventual goal of getting to a real world problem).
Is there a post or paper which talks about this in more detail?
I understand optimizing for an imperfect measurement, but it’s not clear to me if/how this is linked to small perturbation adversarial examples beyond general handwaving about the deficiencies of machine learning.