I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me.
Yup, that sounds like a crux. Bookmarked for later.
I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
Yup, that sounds like a crux. Bookmarked for later.