If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we’re not sure about those alignment proposals. But then you’re still susceptible to the same kinds of problems.
You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]
I think we need to unpack “sufficiently aligned”; here’s my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a “sufficiently aligned” AI that, conditional on a proposal looking promising, is likely to be actually correct.
A system that produces a random 10000-bit string that looks promising to a human reviewer is not “sufficiently aligned”
A system that follows the process that the most truthful possible humans use to do alignment research is sufficiently aligned (or if not, we’re doomed anyway). Truth-seeking humans doing alignment research are only accessing a tiny part of the space of 2^200 persuasive ideas, and most of this is in the subset of 2^100 truthful ideas
If the system is selecting for appearance, it needs to also have 100 bits of selection towards truth to be sufficiently aligned.
We can’t get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.
AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment.
Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal.
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations “fool us” is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Number of proposals which look good to the human
Number of proposals which look good to the human AND are actually good
Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. “Number of proposals which look good to the human AND are actually good” has one more complicated constraint than “Number of proposals which look good to the human”, and will therefore be exponentially smaller.
So in “it would be much easier to trick us than to write down a good proposal”, the relevant operationalization of “easier” for this argument is “the number of proposals which both look good and are good is exponentially smaller than the number which look good”.
I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me.
Yup, that sounds like a crux. Bookmarked for later.
I strongly agree with you that it’ll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren’t good from ones that look good and are actually good.
There is a much stronger version of the claim “alignment proposals are easier to evaluate than to generate” that I think we’re discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I’m not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you’re just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I’m pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I’ve written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback
Is the claim here that the 2^200 “persuasive ideas” would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
Assuming humans can always find the truth eventually, the number of persuasive ideas probably shrinks as humans have more time—maybe 2^300 in a training loop, 2^250 for Paul thinking for a day, 2^200 for Paul thinking for a month, 2^150 for Paul thinking for 5 years… I think the core point still applies.
If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
But this is pretty likely the case though, isn’t it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn’t help much because you’re still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this.
But suppose we’re able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it’s safe to end this process and actually hit the run button?
It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them.
This is all true notwithstanding the fact that we often make mistakes. (Though as we’ve discussed before, I think that a lot of the examples you point to in cryptography are cases where there were pretty obvious gaps in formalisms or possible improvements in systems, and those would have motivated a search for better alternatives if doing so was cheap with AI labor.)
The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:
It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them.
Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former?
Then consider that we don’t actually know that AES is secure, because we don’t know all the possible attacks and we don’t know how to prove it secure, i.e., we don’t know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn’t finding an actually good cryptosystem be trivial at that point compared to all the previous effort?
Some of your other points are valid, I think, but cryptography is just easier than alignment (don’t have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point.
we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
Proposals generated by humans might contain honest mistakes, but they’re not very likely to be adversarially selected to look secure while actually not being secure.
We’re implicitly relying on the alignment of the human in our evaluation of human-generated alignment proposals. Even if we couldn’t tell the difference between the proposals that are safe.
If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we’re not sure about those alignment proposals. But then you’re still susceptible to the same kinds of problems.
You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]
yeah that’s a fair point
I think we need to unpack “sufficiently aligned”; here’s my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a “sufficiently aligned” AI that, conditional on a proposal looking promising, is likely to be actually correct.
A system that produces a random 10000-bit string that looks promising to a human reviewer is not “sufficiently aligned”
A system that follows the process that the most truthful possible humans use to do alignment research is sufficiently aligned (or if not, we’re doomed anyway). Truth-seeking humans doing alignment research are only accessing a tiny part of the space of 2^200 persuasive ideas, and most of this is in the subset of 2^100 truthful ideas
If the system is selecting for appearance, it needs to also have 100 bits of selection towards truth to be sufficiently aligned.
We can’t get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.
Is your story:
AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment.
Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal.
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations “fool us” is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Number of proposals which look good to the human
Number of proposals which look good to the human AND are actually good
Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. “Number of proposals which look good to the human AND are actually good” has one more complicated constraint than “Number of proposals which look good to the human”, and will therefore be exponentially smaller.
So in “it would be much easier to trick us than to write down a good proposal”, the relevant operationalization of “easier” for this argument is “the number of proposals which both look good and are good is exponentially smaller than the number which look good”.
I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
Yup, that sounds like a crux. Bookmarked for later.
I strongly agree with you that it’ll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren’t good from ones that look good and are actually good.
There is a much stronger version of the claim “alignment proposals are easier to evaluate than to generate” that I think we’re discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I’m not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you’re just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I’m pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I’ve written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback
Is the claim here that the 2^200 “persuasive ideas” would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
Assuming humans can always find the truth eventually, the number of persuasive ideas probably shrinks as humans have more time—maybe 2^300 in a training loop, 2^250 for Paul thinking for a day, 2^200 for Paul thinking for a month, 2^150 for Paul thinking for 5 years… I think the core point still applies.
I endorse this explanation.
But this is pretty likely the case though, isn’t it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn’t help much because you’re still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this.
But suppose we’re able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it’s safe to end this process and actually hit the run button?
It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them.
This is all true notwithstanding the fact that we often make mistakes. (Though as we’ve discussed before, I think that a lot of the examples you point to in cryptography are cases where there were pretty obvious gaps in formalisms or possible improvements in systems, and those would have motivated a search for better alternatives if doing so was cheap with AI labor.)
The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:
Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former?
Then consider that we don’t actually know that AES is secure, because we don’t know all the possible attacks and we don’t know how to prove it secure, i.e., we don’t know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn’t finding an actually good cryptosystem be trivial at that point compared to all the previous effort?
Some of your other points are valid, I think, but cryptography is just easier than alignment (don’t have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point.
Proposals generated by humans might contain honest mistakes, but they’re not very likely to be adversarially selected to look secure while actually not being secure.
We’re implicitly relying on the alignment of the human in our evaluation of human-generated alignment proposals. Even if we couldn’t tell the difference between the proposals that are safe.