ryan_greenblatt comments on Oliver Daniels-Koch’s Shortform

ryan_greenblatt 18 Mar 2024 4:41 UTC
4 points
2
The remarks by Erik Jenner seem good, but I would also just consider the following setup:

Imagine that we have a human looking at an AI’s actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).

These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.

The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.

Now, what can go wrong?
- Problematic consequences which couldn’t at all be captured in a human looking at these measurements because either:
  - The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
  - The consequences aren’t something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don’t understand or know about.)
- Problematic consequences which would “by default” be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
- Cases which are a mix of the prior two: not exactly “by default” captured, but also not impossible to capture.
When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).

It’s in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it’s very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don’t have any training data that relates to this?)

Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it’s knowledge. This is relatively easy in the “average” or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.

(Also, in the high-stakes case, we might run into issues where a given observation doesn’t make sense: you can’t observe something if you’re dead.)