johnswentworth comments on Externalized reasoning oversight: a research direction for language model alignment

johnswentworth 5 Aug 2022 1:58 UTC
LW: 5 AF: 5
1
AF
I don’t think those are the most relevant points (although I can see why they’re the most salient). I’d say sections B.2-B.3 are the real dealbreaker for most proposals. In particular:
- 18: There’s no reliable Cartesian-sensory ground truth about how good the AI’s answers are. There just isn’t a reliable training signal.
- 20: Human raters make systematic errors—regular, compactly describable, predictable errors. So we can’t use human ratings to figure out which outputs are good without selecting for outputs which exploit human biases.
- 27: Optimizing against an interpreted thought optimizes against interpretability. So whatever oversight we’re using, we can’t optimize against it; if the oversight shows the AI doing something bad, we can’t just give negative reward to those interpretable-bad-thoughts.
- 29: Human beings cannot inspect an AGI’s output to determine whether the consequences will be good. This is similar to 18 & 20; it means we don’t have a ground truth training signal for “good answers”. (My comment at the top of this thread essentially argues that this same problem also applies to the AI’s internal thoughts, not just its outputs.)
- 32, which talks about text-based oversight in particular.
The bottom line of all of these is: even setting aside inner alignment failures, we just do not have an outer training signal for reliably good answers.
- johnswentworth 5 Aug 2022 1:58 UTC
  LW: 3 AF: 3
  1
  AF Parent
  Now, the usual way to get around that problem is to have the system mimic humans. Then, the argument goes, the system’s answers will be about as good as the humans would have figured out anyway, maybe given a bit more time. (Not a lot more time, because extrapolating too far into the future is itself failure-prone, since we push further and further out of the training distribution as we ask for hypothetical text from further and further into the future.) This is a fine plan, but inherently limited in terms of how much it buys us. It mostly works in worlds where we were already close to solving the problem ourselves, and the further we were from solving it, the less likely that the human-mimicking AI will close the gap. E.g. if I already have most of the pieces, then GPT-6 is much more likely to generate a workable solution when prompted to give “John Wentworth’s Solution To The Alignment Problem” from the year 2035, whereas if I don’t already have most of the pieces then that is much less likely to work. And that’s true even if I will eventually figure out the pieces; the problem is that the pieces I haven’t figured out yet aren’t in the training distribution. So mostly, the way to increase the chances of that strategy working is by getting closer to solving the problem ourselves.