I’d be scared that the “Am I tricking you?” head just works by:
Predicting what the human will predict
Predicting what will actually happen
Output a high value iff the human’s prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren’t, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there’s a big genre of proposals that try to leverage that kind of structure, which might be promising (though it’s not the kind of thing I’m thinking about right now).
I’d be scared that the “Am I tricking you?” head just works by:
Predicting what the human will predict [when experiment E is performed]
Predicting what will actually happen [when experiment E is performed]
Output a high value iff the human’s prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what’s in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.
Suppose the value head learns to predict “Will the human be confidently wrong about the outcome of this experiment,” where an ‘experiment’ is a natural language description of a sequence of actions that the human could execute. And then the experiment head produces natural language descriptions of actions that a human could take for which they’d be confidently wrong.
What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering?
ETA: here’s my best guess after reading the other comment—after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from “experiment that human would be confidently wrong about” since a human who doesn’t understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right?
If so it seems like there are a few problems:
The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering).
If you tamper with the mechanism by which the human “executes” the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works.
Like Mark I do expect forms of tampering that always look fine according to sensors. I agree that beliefs need to cash out in anticipated experience, but it still seems possible to create inputs on which e.g. your camera is totally disconnected from reality.
Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.
How do we use this to construct new sensors that allow the human to detect tampering?
I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.
The proposed experiment could itself perform tampering
Yep this is a problem. “Was I tricking you?” isn’t being distinguished from “Can I trick you after the fact?”.
The other problems seem like real problems too; more thought required....
I’d be scared that the “Am I tricking you?” head just works by:
Predicting what the human will predict
Predicting what will actually happen
Output a high value iff the human’s prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren’t, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there’s a big genre of proposals that try to leverage that kind of structure, which might be promising (though it’s not the kind of thing I’m thinking about right now).
Tweaking your comment slightly:
Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what’s in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.
Suppose the value head learns to predict “Will the human be confidently wrong about the outcome of this experiment,” where an ‘experiment’ is a natural language description of a sequence of actions that the human could execute. And then the experiment head produces natural language descriptions of actions that a human could take for which they’d be confidently wrong.What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering?ETA: here’s my best guess after reading the other comment—after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from “experiment that human would be confidently wrong about” since a human who doesn’t understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right?
If so it seems like there are a few problems:
The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering).
If you tamper with the mechanism by which the human “executes” the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works.
Like Mark I do expect forms of tampering that always look fine according to sensors. I agree that beliefs need to cash out in anticipated experience, but it still seems possible to create inputs on which e.g. your camera is totally disconnected from reality.
Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.
I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.
Yep this is a problem. “Was I tricking you?” isn’t being distinguished from “Can I trick you after the fact?”.
The other problems seem like real problems too; more thought required....