Eliezer appears to expect AI systems performing extremely fast recursive self-improvement before those systems are able to make superhuman contributions to other domains (including alignment research), but I think this is mostly unjustified. If Eliezer doesn’t believe this, then his arguments about the alignment problem that humans need to solve appear to be wrong.
My understanding of Eliezer’s view is that some domains are much harder to do aligned cognition in than others, and alignment is among the hardest.
(I’m not sure I clearly understand why. Maybe because it entails reasoning about humans and reasoning about building super powerful systems, so if your R&D SEAI is even a little bit unaligned, it will have ample leverage for seizing power?)
It’s not so much that AIs will be able to do recursive self improvement before they’re able to solve alignment. It’s that making alignment progress is itself heavily alignment loaded, in a way that “recursively self improve” (without regard for alignment), isn’t.
I agree that Eliezer holds that view (and also disagree—I think this is the consensus view around LW but haven’t seen anything I found persuasive as a defense). I don’t think that’s his whole view, since he frequently talks about AI doing explosive improvement before other big scientific changes, and generally seems to be living in a world where this is an obvious and unstated assumption behind many of the other things he says.
I think this is the consensus view around LW [that AI can’t help with alignment research] but haven’t seen anything I found persuasive as a defense
I thought it was an argument from inaccessible information: we know how to specify rewards for “Win a Go game”, “Predict the next token”, or “A human approved this output”; we don’t know how to specify rewards for “Actually good alignment research”.
I’m imagining that the counterargument might be that earlier weak alignment techniques (and the generation/verification gap) might be enough to bootstrap later, more automated alignment techniques?
Yeah, I don’t find “we can’t verify good alignment research” nearly as persuasive as other people around here:
Verification does seem way easier, even for alignment research. This is probably the most interesting and perplexing disagreement.
Even if verification isn’t easier than generation, you can still just do what a human would do faster. That seems like a big deal, and quite a lot of what early AI systems will be doing. Focusing only on generation vs verification seems like it’s radically understating the case.
AI systems can also help with verification, e.g. noticing problems in possible ideas, generating experimental setups in which to evaluate ideas, and so forth. These tasks don’t seem especially hard to verify either.
You could imagine training an ML system end-to-end on “make the next ML system smarter.” But that’s not how things appear to be going (and it’s just really hard to do with gradient decent). Instead it looks likely that ML systems will probably be doing things more like “solve subtasks identified by other humans or AIs” (just like most humans who work on these things). In this regime, having a reward function for the end result isn’t that important.
To the extent we can’t recognize good alignment research when we see it, I think that also makes humans’ alignment research less efficient, and so the comparative advantage question is less obvious than the absolute difficulty question.
I think that “we can’t verify good alignment research” is probably a smaller consideration than “alignment is more labor intensive while capabilities research is more capital intensive.” Neither is decisive, and I expect other factors will mostly dominate (like changes in allocation of labor).
This isn’t to say I think it’s easy to get AI systems to solve alignment for you, such that it doesn’t matter if you work on it in advance. But I’m not yet persuaded at all by “AI systems will be crazy superhuman before they make big contributions in alignment,” and don’t think that the LW community should particularly expect other folks to be persuaded either.
we know how to specify rewards for… “A human approved this output”; we don’t know how to specify rewards for “Actually good alignment research”.
Can’t these be the same thing? If we have humans who can identify actually good alignment research, we can sit them down in the RLHF booth and have the AI try to figure out how to make them happy.
Now obviously a sufficiently clever AI will infer the existence of the RLHF booth and start hacking the human in order to escape its box, which would be bad for alignment research. But it’s looking increasingly plausible that e.g. GPT-6 will be smart enough to provide actually good mathematical research without being smart enough to take over the world (that doesn’t happen until GPT-8). So why not alignment research?
To break the comparison I think you need to posit either that alignment research is way harder than math research (as Eli understands Eliezer does) such that anything smart enough to do it is also smart enough to hack a human, or I suppose it could be the case that we don’t have humans who can identify actually good alignment research.
My understanding of Eliezer’s view is that some domains are much harder to do aligned cognition in than others, and alignment is among the hardest.
(I’m not sure I clearly understand why. Maybe because it entails reasoning about humans and reasoning about building super powerful systems, so if your R&D SEAI is even a little bit unaligned, it will have ample leverage for seizing power?)
It’s not so much that AIs will be able to do recursive self improvement before they’re able to solve alignment. It’s that making alignment progress is itself heavily alignment loaded, in a way that “recursively self improve” (without regard for alignment), isn’t.
I agree that Eliezer holds that view (and also disagree—I think this is the consensus view around LW but haven’t seen anything I found persuasive as a defense). I don’t think that’s his whole view, since he frequently talks about AI doing explosive improvement before other big scientific changes, and generally seems to be living in a world where this is an obvious and unstated assumption behind many of the other things he says.
I thought it was an argument from inaccessible information: we know how to specify rewards for “Win a Go game”, “Predict the next token”, or “A human approved this output”; we don’t know how to specify rewards for “Actually good alignment research”.
I’m imagining that the counterargument might be that earlier weak alignment techniques (and the generation/verification gap) might be enough to bootstrap later, more automated alignment techniques?
Yeah, I don’t find “we can’t verify good alignment research” nearly as persuasive as other people around here:
Verification does seem way easier, even for alignment research. This is probably the most interesting and perplexing disagreement.
Even if verification isn’t easier than generation, you can still just do what a human would do faster. That seems like a big deal, and quite a lot of what early AI systems will be doing. Focusing only on generation vs verification seems like it’s radically understating the case.
AI systems can also help with verification, e.g. noticing problems in possible ideas, generating experimental setups in which to evaluate ideas, and so forth. These tasks don’t seem especially hard to verify either.
You could imagine training an ML system end-to-end on “make the next ML system smarter.” But that’s not how things appear to be going (and it’s just really hard to do with gradient decent). Instead it looks likely that ML systems will probably be doing things more like “solve subtasks identified by other humans or AIs” (just like most humans who work on these things). In this regime, having a reward function for the end result isn’t that important.
To the extent we can’t recognize good alignment research when we see it, I think that also makes humans’ alignment research less efficient, and so the comparative advantage question is less obvious than the absolute difficulty question.
I think that “we can’t verify good alignment research” is probably a smaller consideration than “alignment is more labor intensive while capabilities research is more capital intensive.” Neither is decisive, and I expect other factors will mostly dominate (like changes in allocation of labor).
This isn’t to say I think it’s easy to get AI systems to solve alignment for you, such that it doesn’t matter if you work on it in advance. But I’m not yet persuaded at all by “AI systems will be crazy superhuman before they make big contributions in alignment,” and don’t think that the LW community should particularly expect other folks to be persuaded either.
Can’t these be the same thing? If we have humans who can identify actually good alignment research, we can sit them down in the RLHF booth and have the AI try to figure out how to make them happy.
Now obviously a sufficiently clever AI will infer the existence of the RLHF booth and start hacking the human in order to escape its box, which would be bad for alignment research. But it’s looking increasingly plausible that e.g. GPT-6 will be smart enough to provide actually good mathematical research without being smart enough to take over the world (that doesn’t happen until GPT-8). So why not alignment research?
To break the comparison I think you need to posit either that alignment research is way harder than math research (as Eli understands Eliezer does) such that anything smart enough to do it is also smart enough to hack a human, or I suppose it could be the case that we don’t have humans who can identify actually good alignment research.