paulfchristiano comments on ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano 18 Dec 2021 23:02 UTC
LW: 3 AF: 3
AF
I’d be scared that the “Am I tricking you?” head just works by:
1. Predicting what the human will predict
2. Predicting what will actually happen
3. Output a high value iff the human’s prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren’t, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there’s a big genre of proposals that try to leverage that kind of structure, which might be promising (though it’s not the kind of thing I’m thinking about right now).
- Ramana Kumar 20 Dec 2021 11:22 UTC
  LW: 1 AF: 1
  AF Parent
  Tweaking your comment slightly:
  I’d be scared that the “Am I tricking you?” head just works by:
  Predicting what the human will predict [when experiment E is performed]
  Predicting what will actually happen [when experiment E is performed]
  Output a high value iff the human’s prediction is confident but different from reality.
  If this is the case, then the head will report detectable tampering but not undetectable tampering.
  Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what’s in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.
  - paulfchristiano 20 Dec 2021 16:23 UTC
    LW: 2 AF: 2
    AF Parent
    Suppose the value head learns to predict “Will the human be confidently wrong about the outcome of this experiment,” where an ‘experiment’ is a natural language description of a sequence of actions that the human could execute. And then the experiment head produces natural language descriptions of actions that a human could take for which they’d be confidently wrong.
    What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering?
    ETA: here’s my best guess after reading the other comment—after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from “experiment that human would be confidently wrong about” since a human who doesn’t understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right?
    If so it seems like there are a few problems:
    The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering).
    If you tamper with the mechanism by which the human “executes” the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works.
    Like Mark I do expect forms of tampering that always look fine according to sensors. I agree that beliefs need to cash out in anticipated experience, but it still seems possible to create inputs on which e.g. your camera is totally disconnected from reality.
    - Ramana Kumar 23 Dec 2021 12:34 UTC
      LW: 1 AF: 1
      AF Parent
      Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.
      ~~How do we use this to construct new sensors that allow the human to detect tampering?~~
      I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.
      The proposed experiment could itself perform tampering
      Yep this is a problem. “Was I tricking you?” isn’t being distinguished from “Can I trick you after the fact?”.
      
      The other problems seem like real problems too; more thought required....