This seems to completely ignore the main problem with approaches which try to outsource alignment research to AGI: optimizing for alignment strategies which look promising to a human reviewer will also automatically incentivize strategies which fool the human reviewer. Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
I think it’s very unclear how big a problem Goodhart is for alignment research—it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don’t have formal theorem statements. There are also domains where it’s not much easier, where the whole thing rests on complicated judgments where the search for clever arguments just isn’t doing much work.
It looks to me like alignment is somewhere in the middle, though it’s not at all clear—right now there are different strands of alignment progress, which seem to have very different properties with respect to the ease of evaluation.
The kind of Goodhart we are usually concerned about is stuff like “it’s easier to hijack the reward signal than to actually perform a challenging task,” and I don’t think that’s very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
I think it’s very unclear how big a problem Goodhart is for alignment research—it seems like a question about a particular technical domain.
Just a couple weeks ago I had this post talking about how, in some technical areas, we’ve been able to find very robust formulations of particular concepts (i.e. “True Names”). The domains where evaluation is much easier—math, physics, CS—are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we’re in a domain where we don’t have a robust mathematical formulation of the phenomena of interest.
The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.
So I agree that the difficulty of evaluation varies by domain, but I don’t think it’s some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.
The kind of Goodhart we are usually concerned about is stuff like “it’s easier to hijack the reward signal than to actually perform a challenging task,” and I don’t think that’s very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.
I don’t buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think “recognition is not trivial” is different from “recognition is as hard as generation.”
I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of “the fuzziness spectrum” as math and physics. And it does seem like concept fuzziness would make evaluation harder.
I’ll note though that ARC’s approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).
If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we’re not sure about those alignment proposals. But then you’re still susceptible to the same kinds of problems.
You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]
I think we need to unpack “sufficiently aligned”; here’s my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a “sufficiently aligned” AI that, conditional on a proposal looking promising, is likely to be actually correct.
A system that produces a random 10000-bit string that looks promising to a human reviewer is not “sufficiently aligned”
A system that follows the process that the most truthful possible humans use to do alignment research is sufficiently aligned (or if not, we’re doomed anyway). Truth-seeking humans doing alignment research are only accessing a tiny part of the space of 2^200 persuasive ideas, and most of this is in the subset of 2^100 truthful ideas
If the system is selecting for appearance, it needs to also have 100 bits of selection towards truth to be sufficiently aligned.
We can’t get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.
AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment.
Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal.
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations “fool us” is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Number of proposals which look good to the human
Number of proposals which look good to the human AND are actually good
Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. “Number of proposals which look good to the human AND are actually good” has one more complicated constraint than “Number of proposals which look good to the human”, and will therefore be exponentially smaller.
So in “it would be much easier to trick us than to write down a good proposal”, the relevant operationalization of “easier” for this argument is “the number of proposals which both look good and are good is exponentially smaller than the number which look good”.
I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me.
Yup, that sounds like a crux. Bookmarked for later.
I strongly agree with you that it’ll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren’t good from ones that look good and are actually good.
There is a much stronger version of the claim “alignment proposals are easier to evaluate than to generate” that I think we’re discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I’m not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you’re just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I’m pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I’ve written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback
Is the claim here that the 2^200 “persuasive ideas” would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
Assuming humans can always find the truth eventually, the number of persuasive ideas probably shrinks as humans have more time—maybe 2^300 in a training loop, 2^250 for Paul thinking for a day, 2^200 for Paul thinking for a month, 2^150 for Paul thinking for 5 years… I think the core point still applies.
If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
But this is pretty likely the case though, isn’t it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn’t help much because you’re still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this.
But suppose we’re able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it’s safe to end this process and actually hit the run button?
It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them.
This is all true notwithstanding the fact that we often make mistakes. (Though as we’ve discussed before, I think that a lot of the examples you point to in cryptography are cases where there were pretty obvious gaps in formalisms or possible improvements in systems, and those would have motivated a search for better alternatives if doing so was cheap with AI labor.)
The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:
It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them.
Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former?
Then consider that we don’t actually know that AES is secure, because we don’t know all the possible attacks and we don’t know how to prove it secure, i.e., we don’t know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn’t finding an actually good cryptosystem be trivial at that point compared to all the previous effort?
Some of your other points are valid, I think, but cryptography is just easier than alignment (don’t have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point.
we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
Proposals generated by humans might contain honest mistakes, but they’re not very likely to be adversarially selected to look secure while actually not being secure.
We’re implicitly relying on the alignment of the human in our evaluation of human-generated alignment proposals. Even if we couldn’t tell the difference between the proposals that are safe.
I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don’t think you get this problem, at least anymore than we have it now—and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
What are people’s timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?
Today’s language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we’ll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I’m very curious if others disagree here though
I personally have pretty broad error bars; I think it’s plausible enough that AI won’t help with automating alignment that it’s still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they’ll inevitably be trying to kill you. I don’t think he’s really explained this view, and I’ve never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I’m not totally sure how much it comes from Eliezer but I’d guess that’s a lot of it.
I think part of Eliezer’s views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it’s tied up with Eliezer’s views about how to build AGI closely enough that Eliezer won’t want to defend his position here.
(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)
Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:
Similar remarks apply to interpreting and answering “What will be its effect on _?” It turns out that getting an AI to understand human language is a very hard problem, and it may very well be that even though talking doesn’t feel like having a utility function, our brains are using consequential reasoning to do it. Certainly, when I write language, that feels like I’m being deliberate. It’s also worth noting that “What is the effect on X?” really means “What are the effects I care about on X?” and that there’s a large understanding-the-human’s-utility-function problem here.
In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden’s proposal would fail.
Of course getting an AI to tell you what it’s really thinking may be hard (and indeed I think it’s hard enough that I think there’s a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it’s hard (or at least I’ve often defended him based on a more charitable reading of his overall views).
But my point is that to the extent Eliezer has explained why he thinks AI won’t be helpful until it’s too late, so far it doesn’t seem like adjacent intuitions have stood the test of time well.
Today’s language models are on track to become quite useful, without showing signs of deceptive misalignment...
facepalm
It won’t show signs of deceptive alignment. The entire point of “deception” is not showing signs. Unless it’s just really incompetent at deception, there won’t be signs; the lack of signs is not significant evidence of a lack of deception.
There may be other reasons to think our models are not yet deceptive to any significant extent (I certainly don’t think they are), but the lack of signs of deception is not one of them.
I understand that deceptive models won’t show signs of deception :) That’s why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it’s more like “as competent as a deceptive four-year old” (my parents totally caught me when I told my first lie), than “as competent as a silver-tongued sociopath playing a long game.”
I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don’t-notice deception.
That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.
A model which is just predicting the next word isn’t optimizing for strategies which look good to a human reviewer, it’s optimizing for truth itself (as contained in it’s training data). If you begin re-feeding its outputs as training inputs then there could be a feedback loop leading to such incentives, but if the model is general and sufficient intelligent, you don’t need to do that. You can train it in a different domain and it will generalize to your domain of interest.
Even if you that, you can try to make the new data grounded in reality in some way, like including experiment results. And the model won’t just absorb the new data as truth, it will include it in it’s world model to make better predictions. If it’s fed a bunch of new alignment forum posts that are bad ideas which look good to humans, it will just predict that alignment forum produces that kind of post, but that doesn’t mean there isn’t some prompt that can make it output what it actually thinks is correct.
This seems to completely ignore the main problem with approaches which try to outsource alignment research to AGI: optimizing for alignment strategies which look promising to a human reviewer will also automatically incentivize strategies which fool the human reviewer. Evaluation is not actually easier than generation, when Goodhart is the main problem to begin with.
I think it’s very unclear how big a problem Goodhart is for alignment research—it seems like a question about a particular technical domain. There are domains where evaluation is much easier; most obviously mathematics, but also in e.g. physics or computer science, there are massive gaps between recognition and generation even if you don’t have formal theorem statements. There are also domains where it’s not much easier, where the whole thing rests on complicated judgments where the search for clever arguments just isn’t doing much work.
It looks to me like alignment is somewhere in the middle, though it’s not at all clear—right now there are different strands of alignment progress, which seem to have very different properties with respect to the ease of evaluation.
The kind of Goodhart we are usually concerned about is stuff like “it’s easier to hijack the reward signal than to actually perform a challenging task,” and I don’t think that’s very tightly correlated with the question about alignment. So this feels like the rhetoric here involves a bit of an equivocation.
Just a couple weeks ago I had this post talking about how, in some technical areas, we’ve been able to find very robust formulations of particular concepts (i.e. “True Names”). The domains where evaluation is much easier—math, physics, CS—are the domains where we have those robust formulations. Even within e.g. physics, evaluation stops being easy when we’re in a domain where we don’t have a robust mathematical formulation of the phenomena of interest.
The other point of that post is that we do not currently have such formulations for the phenomena of interest in alignment, and (one way of framing) the point of foundational agency research is to find them.
So I agree that the difficulty of evaluation varies by domain, but I don’t think it’s some mysterious hard-to-predict thing. The places where robust evaluation is easy all build on qualitatively-similar foundational pieces, and alignment does not yet have those sorts of building blocks.
Go take a look at that other post, it has two good examples of how Goodhart shows up as a central barrier to alignment.
I don’t buy the empirical claim about when recognition is easier than generation. As an example, I think that you can recognize robust formulations much more easily than you can generate them in math, computer science, and physics. In general I think “recognition is not trivial” is different from “recognition is as hard as generation.”
I found this comment pretty convincing. Alignment has been compared to philosophy, which seems at the opposite end of “the fuzziness spectrum” as math and physics. And it does seem like concept fuzziness would make evaluation harder.
I’ll note though that ARC’s approach to alignment seems more math-problem-flavored than yours, which might be a source of disagreement between you two (since maybe you conceptualize what it means to work on alignment differently).
If it turns out that evaluation of alignment proposals is not easier than generation, we’re in pretty big trouble because we’ll struggle to convince others that any good alignment proposals humans come up with are worth implementing.
You could still argue by generalization that we should use alignment proposals produced by humans who had a lot of good proposals on other problems even if we’re not sure about those alignment proposals. But then you’re still susceptible to the same kinds of problems.
You might think that humans are more robust on the distribution of [proposals generated by humans trying to solve alignment] vs [proposals generated by a somewhat superhuman model trying to get a maximal score]
yeah that’s a fair point
I think we need to unpack “sufficiently aligned”; here’s my attempt. There are A=2^10000 10000-bit strings. Maybe 2^1000 of them are coherent English text, and B=2^200 of these are alignment proposals that look promising to a human reviewer, and C=2^100 of them are actually correct and will result in aligned AI.The thesis of the post requires that we can make a “sufficiently aligned” AI that, conditional on a proposal looking promising, is likely to be actually correct.
A system that produces a random 10000-bit string that looks promising to a human reviewer is not “sufficiently aligned”
A system that follows the process that the most truthful possible humans use to do alignment research is sufficiently aligned (or if not, we’re doomed anyway). Truth-seeking humans doing alignment research are only accessing a tiny part of the space of 2^200 persuasive ideas, and most of this is in the subset of 2^100 truthful ideas
If the system is selecting for appearance, it needs to also have 100 bits of selection towards truth to be sufficiently aligned.
We can’t get those 100 bits through further selection for appearance. It seems plausible that we can get them somehow, though.
Is your story:
AI systems are likely to be much better at persuasion than humans, relative to how good they are at alignment.
Actually if a human was trying to write down a convincing alignment proposal, it would be much easier to trick us than to write down a good proposal.
It sounds like you are thinking of 2. But I think we have reasonably good intuitions about that. I think for short evaluations “fool us” is obviously easier. For long evaluations (including similarly-informed critics pointing out holes etc.) I think that it rapidly becomes easier to just do good work (though it clearly depends on the kind of work).
Consider the space of 10-page google docs. Within this space, we pick out all the google docs which some human evaluator would consider a good alignment proposal. (You can imagine the human is assisted in some way if you want, it makes little difference to this particular argument.) Then the question is, what fraction of these will actually be good alignment proposals? So, we have two relevant numbers:
Number of proposals which look good to the human
Number of proposals which look good to the human AND are actually good
Now, the key heuristic: in a high-dimensional space, adding any non-simple constraint will exponentially shrink the search space. “Number of proposals which look good to the human AND are actually good” has one more complicated constraint than “Number of proposals which look good to the human”, and will therefore be exponentially smaller.
So in “it would be much easier to trick us than to write down a good proposal”, the relevant operationalization of “easier” for this argument is “the number of proposals which both look good and are good is exponentially smaller than the number which look good”.
I think that argument applies just as easily to a human as to a model, doesn’t it?
So it seems like you are making an equally strong claim that “if a human tries to write down something that looks like good alignment work almost all of it will be persuasive but bad.” And I think that’s kind of true and kind of not true. In general I think you can get much better estimates by thinking about delegating to sociopathic humans (or to humans with slightly different comparative advantages) than trying to make a counting argument.
(I think the fact that “how smart the human is” doesn’t matter mostly just proves that the counting argument is untethered from the key considerations.)
A human writing their own alignment proposal has introspective access to the process-which-generates-the-proposal, and can get a ton of bits from that. They can trust the process, rather than just the output.
A human who is good at making their own thinking process legible to others, coupled with an audience who knows to look for that, could get similar benefits in a more distributed manner. Faking a whole though process is more difficult, for a human, than simply faking an output. That does not apply nearly as well to an AI; it is far more likely that the AI’s thought-process would be very different from ours, such that it would be easier to fake a human-legible path than to truly follow one from the start.
I think “how smart the human is” is not a key consideration.
I think how well we can evaluate claims and arguments about AI alignment absolutely determines whether delegating alignment to machines is easier than doing alignment ourselves. A heuristic argument that says “evaluation isn’t easier than generation, and that claim is true regardless of how good you are at evaluation until you get basically perfect at it” seems obviously wrong to me. If that’s a good summary of the disagreement I’m happy to just leave it there.
Yup, that sounds like a crux. Bookmarked for later.
I strongly agree with you that it’ll eventually be very difficult for humans to tell apart AI-generated alignment proposals that look good and aren’t good from ones that look good and are actually good.
There is a much stronger version of the claim “alignment proposals are easier to evaluate than to generate” that I think we’re discussing in this thread, where you claim that humans will be able to tell all good alignment proposals apart from bad ones or at least not accept any bad ones (precision matters much more than recall here since you can compensate bad recall with compute). If this strong claim is true, then conceptually RLHF/reward modeling should be sufficient as an alignment technique for the minimal viable product. Personally I think that this strong version of the claim is unlikely to be true, but I’m not certain that I will be false for the first systems that can do useful alignment research.
As William points out below, if we get AI-assisted human evaluation to work well, then we can uncover flaws in alignment proposals that are too hard to find for unassisted humans. This is a weaker version of the claim, because you’re just claiming that humans + AI assistance are better at evaluating alignment proposals than human + AI assistance are at generating them. Generally I’m pretty optimistic about that level of supervision actually allowing us to supervise superhuman alignment research; I’ve written more about this here: https://aligned.substack.com/p/ai-assisted-human-feedback
Is the claim here that the 2^200 “persuasive ideas” would actually pass the scrutiny of top human researchers (for example, Paul Christiano studies one of them for a week and concludes that it is probably a full solution)? Or do you just mean that they would look promising in a shorter evaluation done for training purposes?
Assuming humans can always find the truth eventually, the number of persuasive ideas probably shrinks as humans have more time—maybe 2^300 in a training loop, 2^250 for Paul thinking for a day, 2^200 for Paul thinking for a month, 2^150 for Paul thinking for 5 years… I think the core point still applies.
I endorse this explanation.
But this is pretty likely the case though, isn’t it? Actually I think by default the situation will be the opposite: it will be too easy to convince others that some alignment proposal is worth implementing, because humans are in general too easily convinced by informal arguments that look good but contain hidden flaws (and formalizing the arguments is both very difficult and doesn’t help much because you’re still depending on informal arguments for why the formalized theoretical concepts correspond well enough to the pre-theoretical concepts that we actually care about). Look at the history of philosophy, or cryptography, if you doubt this.
But suppose we’re able to convince people to distrust their intuitive sense of how good an argument is, and to keep look for hidden flaws and counterarguments (which might have their own hidden flaws and so on). Well how do we know when it’s safe to end this process and actually hit the run button?
It feels to me like there’s basically no question that recognizing good cryptosystems is easier than generating them. And recognizing attacks on cryptosystems is easier than coming up with attacks (even if they work by exploiting holes in the formalisms). And recognizing good abstract arguments for why formalisms are inadequate is easier than generating them. And recognizing good formalisms is easier than generating them.
This is all true notwithstanding the fact that we often make mistakes. (Though as we’ve discussed before, I think that a lot of the examples you point to in cryptography are cases where there were pretty obvious gaps in formalisms or possible improvements in systems, and those would have motivated a search for better alternatives if doing so was cheap with AI labor.)
The example of cryptography was mainly intended to make the point that humans are by default too credulous when it comes to informal arguments. But consider your statement:
Consider some cryptosystem widely considered to be secure, like AES. How much time did humanity spend on learning / figuring out how to recognize good cryptosystems (e.g. finding all the attacks one has to worry about, like differential cryptanalysis), versus specifically generating AES with the background knowledge in mind? Maybe the latter is on the order of 10% of the former?
Then consider that we don’t actually know that AES is secure, because we don’t know all the possible attacks and we don’t know how to prove it secure, i.e., we don’t know how to recognize a good cryptosystem. Suppose one day we figure that out, wouldn’t finding an actually good cryptosystem be trivial at that point compared to all the previous effort?
Some of your other points are valid, I think, but cryptography is just easier than alignment (don’t have time to say more as my flight is about to take off), and philosophy is perhaps a better analogy for the more general point.
Proposals generated by humans might contain honest mistakes, but they’re not very likely to be adversarially selected to look secure while actually not being secure.
We’re implicitly relying on the alignment of the human in our evaluation of human-generated alignment proposals. Even if we couldn’t tell the difference between the proposals that are safe.
I think this concern is only relevant if your strategy is to do RL on human evaluations of alignment research. If instead you just imitate the distribution of current alignment research, I don’t think you get this problem, at least anymore than we have it now—and I think you can still substantially accelerate alignment research with just imitation. Of course, you still have inner alignment issues, but from an outer alignment perspective I think imitation of human alignment research is a pretty good thing to try.
Evaluation assistance as mentioned in the post on AI-assisted human feedback could help people avoid being fooled (e.g. in debate where the opponent can point out how you’re being fooled). It’s still an open question how well that will work in practice and how quickly it will Goodhart (these techniques should fail on some things, as discussed in the ELK report), but it seems possible that models will be helpful enough on alignment before they Goodhart.
What are people’s timelines for deceptive alignment failures arising in models, relative to AI-based alignment research being useful?
Today’s language models are on track to become quite useful, without showing signs of deceptive misalignment or its eyebrow-raising pre-requisites (e.g., awareness of the training procedure), afaik. So my current best guess is that we’ll be able to get useful alignment work from superhuman sub-deception agents for 5-10+ years or so. I’m very curious if others disagree here though
I personally have pretty broad error bars; I think it’s plausible enough that AI won’t help with automating alignment that it’s still valuable for us to work on alignment, and plausible enough that AI will help with automating alignment that it significantly increases our chances of survival and is worth preparing for making use of. I also tend to think that current progress in language modeling seems to suggest that models will reach the point of being extremely helpful with alignment way before they become super scary.
Eliezer has consistently expressed confidence that AI systems smart enough to help with alignment will also be smart enough that they’ll inevitably be trying to kill you. I don’t think he’s really explained this view, and I’ve never found it particularly compelling. I think this a lot of folks around LW have absorbed a similar view; I’m not totally sure how much it comes from Eliezer but I’d guess that’s a lot of it.
I think part of Eliezer’s views of this come from a view of intelligence and recursive self-improvement that imply that explosive recursive self-improvement begins before high object-level competence on other research tasks. I think this view is most likely mistaken, but my guess is that it’s tied up with Eliezer’s views about how to build AGI closely enough that Eliezer won’t want to defend his position here.
(My position is the very naive one, that recursive self-improvement will become critical at roughly the same time that AI systems are better than humans at contributing to further AI progress, which has roughly a 50-50 shot of happening before alignment progress.)
Beyond that, Eliezer has not said very much about where these intuitions are coming from. What he has said does not seem (to me) to have fared particularly well over the last few years. For example:
In fact it does not seem hard to get AI systems to understand the relevant parts of human language (relative to being able to easily kill all humans or to inevitably be trying to kill all humans). And it does not seem hard to get an AI to predict which things you will judge to be relevant, well enough that this is a very bad way of explaining why Holden’s proposal would fail.
Of course getting an AI to tell you what it’s really thinking may be hard (and indeed I think it’s hard enough that I think there’s a significant probability that we will all die because we failed to solve it). And I think Eliezer even has a fair model of why it’s hard (or at least I’ve often defended him based on a more charitable reading of his overall views).
But my point is that to the extent Eliezer has explained why he thinks AI won’t be helpful until it’s too late, so far it doesn’t seem like adjacent intuitions have stood the test of time well.
Your link redirects back to this page. The quote is from one of Eliezer’s comments in Reply to Holden on Tool AI.
Thanks, fixed.
facepalm
It won’t show signs of deceptive alignment. The entire point of “deception” is not showing signs. Unless it’s just really incompetent at deception, there won’t be signs; the lack of signs is not significant evidence of a lack of deception.
There may be other reasons to think our models are not yet deceptive to any significant extent (I certainly don’t think they are), but the lack of signs of deception is not one of them.
I understand that deceptive models won’t show signs of deception :) That’s why I made the remark of models not showing signs of prerequisites to scary kinds of deception. Unless you think there are going to be no signs of deception or any prerequisites, for any models before we get deceptive ones?
It also seems at least plausible that models will be imperfectly deceptive before they are perfectly deceptive, in which case we will see signs (e.g., in smaller models)
Not sure I buy this – I have a model of how hard it is to be deceptive, and how competent our current ML systems are, and it looks like it’s more like “as competent as a deceptive four-year old” (my parents totally caught me when I told my first lie), than “as competent as a silver-tongued sociopath playing a long game.”
I do expect there to be signs of deceptive alignment, in a noticeable fashion before we get so-deceptive-we-don’t-notice deception.
That falls squarely under the “other reasons to think our models are not yet deceptive”—i.e. we have priors that we’ll see models which are bad at deception before models become good at deception. The important evidential work there is being done by the prior.
A model which is just predicting the next word isn’t optimizing for strategies which look good to a human reviewer, it’s optimizing for truth itself (as contained in it’s training data). If you begin re-feeding its outputs as training inputs then there could be a feedback loop leading to such incentives, but if the model is general and sufficient intelligent, you don’t need to do that. You can train it in a different domain and it will generalize to your domain of interest.
Even if you that, you can try to make the new data grounded in reality in some way, like including experiment results. And the model won’t just absorb the new data as truth, it will include it in it’s world model to make better predictions. If it’s fed a bunch of new alignment forum posts that are bad ideas which look good to humans, it will just predict that alignment forum produces that kind of post, but that doesn’t mean there isn’t some prompt that can make it output what it actually thinks is correct.