Ethan Perez comments on Externalized reasoning oversight: a research direction for language model alignment

Ethan Perez 4 Aug 2022 21:09 UTC
LW: 10 AF: 5
0
AF
Thanks, this is a helpful comment—can you elaborate on why “ask the AI which questions we should ask” would fail (or point to relevant discussion)? I’m thinking that we would use many different models (not just the model doing the reasoning), including smaller ones, and trained or prompted in different ways, to catch generated text that would cause harmful side effects. We could have all of these models use externalized reasoning as well, to help aid in the supervision/oversight. This obviously doesn’t eliminate all of the risk, since all of the models can coordinate to cause catastrophic side effects; that said, I’d guess that coordination is much harder when you’re using many different models in different ways, and these models can only communicate via externalized reasoning
- johnswentworth 5 Aug 2022 0:26 UTC
  LW: 5 AF: 3
  0
  AF Parent
  “Ask the AI which questions we should ask” is strategically isomorphic to “ask the AI what we should do”. It’s mainly a values question, and only works insofar as we’ve already thoroughly solved the alignment problem for the system at hand. Like, getting a good answer to “ask the AI what we should do” (or “ask the AI which questions we should ask”) is basically a framing of the whole alignment problem.
  Do you already have cached understanding for why “ask the AI what we should do” doesn’t work? If not, and assuming you have already brainstormed on your own for at least an hour, List of Lethalities has multiple. (Indeed, IIRC the large majority of the failure modes on List of Lethalities apply.)
  - Ethan Perez 5 Aug 2022 1:15 UTC
    LW: 3 AF: 2
    1
    AF Parent
    Yes, I have brainstormed about this question for quite a while, and I have read the List of Lethalities. I was mainly wondering if you had a specific concern with the proposal I outlined and/or additional concerns beyond that List.
    The most relevant points from that list in this context I think are (34) and (35), that the agents competing against each other can coordinate in a way that doesn’t involve or benefit humanity. My main uncertainty in this context is how the different agents and supervisors could learn to coordinate, when they can only communicate via human-readable text. They could:
    Use steganography to communicate. Here, I’m unsure on how they’d agree on a protocol for communication if the agent and supervisor don’t have shared weights. I’m sure there’s some loophole I’m missing due to lack of imagination
    Use the text channel itself to communicate to other agents, risking getting caught by human overseers, but relying on the fact that there’s too much communication for us to directly monitor. Here, I can see this risk being mitigated by us using smaller models that are less likely to be power-seekers in overseeing the text communication between model for signs of coordination. The smaller models will be less capable but could e.g. be amplified with decomposition/debate
    Curious if/why you think these strategies are likely to fail!
    - johnswentworth 5 Aug 2022 1:58 UTC
      LW: 5 AF: 5
      1
      AF Parent
      I don’t think those are the most relevant points (although I can see why they’re the most salient). I’d say sections B.2-B.3 are the real dealbreaker for most proposals. In particular:
      18: There’s no reliable Cartesian-sensory ground truth about how good the AI’s answers are. There just isn’t a reliable training signal.
      20: Human raters make systematic errors—regular, compactly describable, predictable errors. So we can’t use human ratings to figure out which outputs are good without selecting for outputs which exploit human biases.
      27: Optimizing against an interpreted thought optimizes against interpretability. So whatever oversight we’re using, we can’t optimize against it; if the oversight shows the AI doing something bad, we can’t just give negative reward to those interpretable-bad-thoughts.
      29: Human beings cannot inspect an AGI’s output to determine whether the consequences will be good. This is similar to 18 & 20; it means we don’t have a ground truth training signal for “good answers”. (My comment at the top of this thread essentially argues that this same problem also applies to the AI’s internal thoughts, not just its outputs.)
      32, which talks about text-based oversight in particular.
      The bottom line of all of these is: even setting aside inner alignment failures, we just do not have an outer training signal for reliably good answers.
      - johnswentworth 5 Aug 2022 1:58 UTC
        LW: 3 AF: 3
        1
        AF Parent
        Now, the usual way to get around that problem is to have the system mimic humans. Then, the argument goes, the system’s answers will be about as good as the humans would have figured out anyway, maybe given a bit more time. (Not a lot more time, because extrapolating too far into the future is itself failure-prone, since we push further and further out of the training distribution as we ask for hypothetical text from further and further into the future.) This is a fine plan, but inherently limited in terms of how much it buys us. It mostly works in worlds where we were already close to solving the problem ourselves, and the further we were from solving it, the less likely that the human-mimicking AI will close the gap. E.g. if I already have most of the pieces, then GPT-6 is much more likely to generate a workable solution when prompted to give “John Wentworth’s Solution To The Alignment Problem” from the year 2035, whereas if I don’t already have most of the pieces then that is much less likely to work. And that’s true even if I will eventually figure out the pieces; the problem is that the pieces I haven’t figured out yet aren’t in the training distribution. So mostly, the way to increase the chances of that strategy working is by getting closer to solving the problem ourselves.