I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?
The remarks by Erik Jenner seem good, but I would also just consider the following setup:
Imagine that we have a human looking at an AI’s actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).
These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.
The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.
Now, what can go wrong?
Problematic consequences which couldn’t at all be captured in a human looking at these measurements because either:
The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
The consequences aren’t something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don’t understand or know about.)
Problematic consequences which would “by default” be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
Cases which are a mix of the prior two: not exactly “by default” captured, but also not impossible to capture.
When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).
It’s in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it’s very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don’t have any training data that relates to this?)
Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it’s knowledge. This is relatively easy in the “average” or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.
(Also, in the high-stakes case, we might run into issues where a given observation doesn’t make sense: you can’t observe something if you’re dead.)
(from conversation with Erik Jenner) roughly 3 classes of applications
MTD all the way down
Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the “primary” training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
I’ve been thinking a little more about the high-level motivation of measurement tampering, and struggling to think through when measurement tampering detection itself is actually going to be useful for mitigating x-risk. Like is human/ai feedback considered a robust measurement device? If no, then what is the most alignment relevant domain MTD could be applied to. If yes, do the structural properties of measurement that supposedly make it easier then general ELK still hold?
The remarks by Erik Jenner seem good, but I would also just consider the following setup:
Imagine that we have a human looking at an AI’s actions and a wide variety of predictions of the future conditional on executing this action. The human can make recursive queries to an AI assistant to attempt to better understand the action or the predictions (recursive reward modeling).
These predictions can correspond to any observation a human might make in the future, e.g. you can imagine training our predictor to predict any of our possible future observations including arbitrary tests we might run.
The human will make a delibrate effort to look for evidence of tampering or other problems when looking at the predicted measurements. Evidence of tampering can look like inconsistant measurements that imply that the measurements are no longer tracking the variable of interest.
Now, what can go wrong?
Problematic consequences which couldn’t at all be captured in a human looking at these measurements because either:
The visible or human understandable bad consequences take a long time to manifest. (Longer than the prediction window.)
The consequences aren’t something that human can even ever understand (e.g. the actions involve torturing invisible fairies that humans don’t understand or know about.)
Problematic consequences which would “by default” be captured by measurements, but which can be covered up. In this case, we hope MTD is applicable.
Cases which are a mix of the prior two: not exactly “by default” captured, but also not impossible to capture.
When viewed from this perspective, MTD seems useful as we seem to often care about consequences that are in principle possible to measure and observe. And, this might correspond to the cases where MTD is applicable (e.g. consistency properties in measurements and/or explicit tampering (which can fail) is required).
It’s in principle possible to solve issues with consequences taking a very long time to manifest (much longer than any available training data) or not being human understable via doing something recursive to implement indirect normativity as described in the ELK appendix I linked below, but it’s very unclear how exactly this is supposed to work. (How can we ensure we always elicit any knowledge our AI has about the very long run future when we don’t have any training data that relates to this?)
Note that for MTD to be useful, we have to ensure that our AIs predictions about the future reflect it’s knowledge. This is relatively easy in the “average” or low-stakes case (we just keep training it to predict the future), but could require additional machinery in the high-stakes case.
(Also, in the high-stakes case, we might run into issues where a given observation doesn’t make sense: you can’t observe something if you’re dead.)
(from conversation with Erik Jenner) roughly 3 classes of applications
MTD all the way down
Come up with a bunch of measurements of things going well (humans self-report being very happy, news stories are good or something). Use GPT-N to predict measurements, and detect measurement tampering, condition on proposed actions of policy. Reward actions that GPT-N predicts will increase measures of things going well without tampering.
Other Scalable Oversight + MTD as reward function / side constraint
Use whatever scalable oversight protocol seems good (Debate, IDA, etc) for the “primary” training signal, while also setting up a bunch of measurements of safety properties (e.g. integrity of the binaries on the server the model is hosted). Do RL on the scalable oversight signal while constantly screening for and negatively rewarding actions flagged as tampering
Other Scalable Oversight + MTD as extra safety check
same as before, but your leaning more on the scalable oversight to avoid tampering in most cases, and detecting measurement tampering as a precautionary step (i.e. you detect measurement tampering you Pause/Stop)
(some ways in which these classes overlap, e.g. using your scalable oversight protocol as the measurement(s))
As far as (1) and similar things, you might find the narrow elicitation appendix in the ELK report and next several appendices (prior to generative model errors) worthwhile.