ryan_greenblatt comments on Benchmarks for Detecting Measurement Tampering [Redwood Research]

ryan_greenblatt 6 Sep 2023 17:15 UTC
LW: 2 AF: 1
0
AF
It’s important to note that we’re discussing low-stakes oversight failures here.

The failure mode could manifest as the model taking numerous actions over a significant period, culminating in it taking over the world. Our hope would be that these actions have measurable effects on a relatively small part of the world—the portion the AI can influence before taking over. This will depend on the AI’s action space, but it seems plausible for a power plant AI.

This is assuming that the AI has not having already escaped in a few actions. We would classify such an escape in a small number of actions as a high-stakes failure. If the AI gradually escaped through a large number of actions over a considerable period, we would be back to a low-stakes failure scenario, and the argument from the previous paragraph could apply.
- lberglund 8 Sep 2023 14:35 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Thanks for clarifying. Preventing escape seems like promising way to prevent these sorts of problems. On the other hand I’m having trouble imagining ways in which we could have sensors that can pick up on whether a model has escaped. (Maybe info-sec people have thought more about this?)