johnswentworth comments on [Link] A minimal viable product for alignment

johnswentworth 9 Apr 2022 23:18 UTC
LW: 4 AF: 3
AF
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
I think “how smart the human is” is not a key consideration.
- paulfchristiano 10 Apr 2022 4:50 UTC
  LW: 9 AF: 7
  AF Parent
  I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
  - johnswentworth 10 Apr 2022 16:09 UTC
    LW: 6 AF: 6
    AF Parent
    A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me.
    Yup, that sounds like a crux. Bookmarked for later.
    What links here?
    Sam Marks's comment on Sam Marks’s Shortform by Sam Marks (13 Apr 2022 21:38 UTC; 21 points)
    Mark Xu's comment on My AI Model Delta Compared To Christiano by johnswentworth (14 Sep 2024 0:01 UTC; 11 points)