Jan comments on Adversarial attacks and optimal control

Jan 23 May 2022 9:29 UTC
2 points
Yes, that’s a pretty fair interpretation! The macroscopic/folk psychology notion of “surprise” of course doesn’t map super cleanly onto the information-theoretic notion. But I tend to think of it as: there is a certain “expected surprise” about what future possible states might look like if everything evolves “as usual”, $I_{p} ([x_{1}, \dots, x_{N}])$ . And then there is the (usually larger) “additional surprise” about the states that the AI might steer us into, $I_{ξ} ([x_{1}, \dots, x_{N}])$ . The delta between those two is the “excess surprise” that the AI needs to be able to bring about.
It’s tricky to come up with a straightforward setting where the actions of the AI can be measured in nats, but perhaps the following works as an intuition pump: “If we give the AI full, unrestricted access to a control panel that controls the universe, how many operations does it have to perform to bring about the catastrophic event?”. That’s clearly still not well defined (there is no obvious/privileged way that the panel should look like), but it shows that 1) the “excess surprise” is a lower bound (we wouldn’t usually give the AI unrestricted access to that panel) and 2) that the minimum amount of operations required to bring about a catastrophic event is probably still larger than 1.
- Adam Jermyn 23 May 2022 17:23 UTC
  5 points
  Parent
  Thanks for clarifying!
  Maybe the ‘actions → nats’ mapping can be sharpened if it’s not an AI but a very naive search process?
  Say the controller can sample k outcomes at random before choosing one to actually achieve. I think that let’s it get ~ln(k) extra nats of surprise, right? Then you can talk about the AI’s ability to control things in terms of ‘the number of random samples you’d need to draw to achieve this much improvement’.
  - Jan 29 May 2022 12:42 UTC
    2 points
    Parent
    This sounds right to me! In particular, I just (re-)discovered this old post by Yudkowsky and this newer post by Alex Flint that both go a lot deeper on the topic. I think the optimal control perspective is a nice complement to those posts and if I find the time to look more into this then that work is probably the right direction.