Here’s an idea that is in it’s infancy which seems related (at least my version of it is, others may have fleshed it out, and links are appreciated). This is not written particularly well and it is speculative:
Say I believe that language models will accelerate research in the lead-up to AGI. (likely assumption)
Say I think that AI systems will be able to automate most of the research process before we get AGI (though at this point we might stop and consider if we’re moving the goalpost). This seems to be an assumption in OpenAI’s alignment plan, though I think it’s unclear how AGI fits into their timetable. (less likely assumption, still plausible)
Given the above, our chances of getting a successfully aligned AGI may depend on “what proportion of automated AI scientists are working on AI Alignment vs. work that moves us closer to AGI?” There’s some extreme version of this where the year before AGI has more scientific research than all of human history, or something wild.
Defining “Endgame” in this scenario seems hard, but maybe Endgame is “AIs can do 50% of the (2022) research work”. It seems likely that in this scenario, those who care about AI Safety might reasonably want to start Endgame sooner if they think it increases the likelihood of enough automated Alignment research happening.
For example: In the current climate, maybe 3 actors have access to the most powerful models (OpenAI, DeepMind, Anthropic), and given that Anthropic is largely focused on AI Safety, we might naively expect that the current climate would have ~⅓ of the automated research be AI Safety if Endgame started tomorrow (depending on the beliefs of decision makers at OpenAI and DeepMind it could be much higher, and in actuality given the current world I think it would be higher). But, say we wait 2 years and then Endgame starts, now there’s 2 other big tech labs and 3 startups who have similarly powerful models (and don’t care much about Safety), and the fraction of automated research which is Alignment is only ~⅛.
In the scenario I’ve outlined above, it is not obvious that Endgame coming sooner is worse. There’s quite a few assumptions that I think are doing a lot of work for this model:
These pre-AGI AIs can do useful alignment research, or significantly speed it up.
I think this one of the bigger cruxes here, where it seems like coming up with useful, novel, theoretical insights to solve for instance deception is just really hard and will require models to be incredibly smart. Folks will disagree here about which alignment research is most important, and some of this will be affected by motivated reasoning around what kind of research you can automate easily (I think Jan Leike’s plan here worries me because it feels a lot like “we need to be able to automate alignment research just as effectively as ML research so that we can ask for more compute for alignment” and this seems likely to lead to an automating of easy-to-automate-but-not-central alignment research; it worries me because I expect the plan to fail; see also Nate Soares).
Research done by AIs in the years leading up to AGI will represent a large chunk of the alignment research that happens, such that it is worth it to lose some human-research-time in order to make this shift come sooner (maybe for this argument to go through it only needs to be the case that the ratio of (good) Alignment/Capabilities research is higher once research is automated).
These pre-AGI AIs are sufficiently aligned. We have a fair amount of confident in our language models’ ability to help with research without causing catastrophic failures. i.e., they are either not trying to overthrow humanity, or they are unable to do so, including through research outputs like plans for future AI systems.
There is some action folks in the AI Safety space could take which would make the move to this Endgame happen sooner. Examples might include: Anthropic scaling language models (and trying to keep them private), developing AI tools to speed up research (Elicit),
Here’s an idea that is in it’s infancy which seems related (at least my version of it is, others may have fleshed it out, and links are appreciated). This is not written particularly well and it is speculative:
Say I believe that language models will accelerate research in the lead-up to AGI. (likely assumption)
Say I think that AI systems will be able to automate most of the research process before we get AGI (though at this point we might stop and consider if we’re moving the goalpost). This seems to be an assumption in OpenAI’s alignment plan, though I think it’s unclear how AGI fits into their timetable. (less likely assumption, still plausible)
Given the above, our chances of getting a successfully aligned AGI may depend on “what proportion of automated AI scientists are working on AI Alignment vs. work that moves us closer to AGI?” There’s some extreme version of this where the year before AGI has more scientific research than all of human history, or something wild.
Defining “Endgame” in this scenario seems hard, but maybe Endgame is “AIs can do 50% of the (2022) research work”. It seems likely that in this scenario, those who care about AI Safety might reasonably want to start Endgame sooner if they think it increases the likelihood of enough automated Alignment research happening.
For example: In the current climate, maybe 3 actors have access to the most powerful models (OpenAI, DeepMind, Anthropic), and given that Anthropic is largely focused on AI Safety, we might naively expect that the current climate would have ~⅓ of the automated research be AI Safety if Endgame started tomorrow (depending on the beliefs of decision makers at OpenAI and DeepMind it could be much higher, and in actuality given the current world I think it would be higher). But, say we wait 2 years and then Endgame starts, now there’s 2 other big tech labs and 3 startups who have similarly powerful models (and don’t care much about Safety), and the fraction of automated research which is Alignment is only ~⅛.
In the scenario I’ve outlined above, it is not obvious that Endgame coming sooner is worse. There’s quite a few assumptions that I think are doing a lot of work for this model:
These pre-AGI AIs can do useful alignment research, or significantly speed it up.
I think this one of the bigger cruxes here, where it seems like coming up with useful, novel, theoretical insights to solve for instance deception is just really hard and will require models to be incredibly smart. Folks will disagree here about which alignment research is most important, and some of this will be affected by motivated reasoning around what kind of research you can automate easily (I think Jan Leike’s plan here worries me because it feels a lot like “we need to be able to automate alignment research just as effectively as ML research so that we can ask for more compute for alignment” and this seems likely to lead to an automating of easy-to-automate-but-not-central alignment research; it worries me because I expect the plan to fail; see also Nate Soares).
Research done by AIs in the years leading up to AGI will represent a large chunk of the alignment research that happens, such that it is worth it to lose some human-research-time in order to make this shift come sooner (maybe for this argument to go through it only needs to be the case that the ratio of (good) Alignment/Capabilities research is higher once research is automated).
These pre-AGI AIs are sufficiently aligned. We have a fair amount of confident in our language models’ ability to help with research without causing catastrophic failures. i.e., they are either not trying to overthrow humanity, or they are unable to do so, including through research outputs like plans for future AI systems.
There is some action folks in the AI Safety space could take which would make the move to this Endgame happen sooner. Examples might include: Anthropic scaling language models (and trying to keep them private), developing AI tools to speed up research (Elicit),
List goes on, but I’m tired