AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment.
Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal.
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations “fool us” is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Number of proposals which look good to the human
Number of proposals which look good to the human AND are actually good
Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. “Number of proposals which look good to the human AND are actually good” has one more complicated constraint than “Number of proposals which look good to the human”, and will therefore be exponentially smaller.
So in “it would be much easier to trick us than to write down a good proposal”, the relevant operationalization of “easier” for this argument is “the number of proposals which both look good and are good is exponentially smaller than the number which look good”.
I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me.
Yup, that sounds like a crux. Bookmarked for later.
I strongly agree with you that it’ll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren’t good from ones that look good and are actually good.
There is a much stronger version of the claim “alignment proposals are easier to evaluate than to generate” that I think we’re discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I’m not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you’re just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I’m pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I’ve written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback
Is your story:
AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment.
Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal.
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations “fool us” is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Number of proposals which look good to the human
Number of proposals which look good to the human AND are actually good
Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. “Number of proposals which look good to the human AND are actually good” has one more complicated constraint than “Number of proposals which look good to the human”, and will therefore be exponentially smaller.
So in “it would be much easier to trick us than to write down a good proposal”, the relevant operationalization of “easier” for this argument is “the number of proposals which both look good and are good is exponentially smaller than the number which look good”.
I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
Yup, that sounds like a crux. Bookmarked for later.
I strongly agree with you that it’ll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren’t good from ones that look good and are actually good.
There is a much stronger version of the claim “alignment proposals are easier to evaluate than to generate” that I think we’re discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I’m not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you’re just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I’m pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I’ve written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback