I got a bit lost in understanding your exit plan. You write
My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them
Some questions about this and the text that comes after it:
How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you’re describing here. What techniques are you imagining using?
Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control “human-obsoleting” AIs?
Why do you “expect that ruling out egregious misalignment is the hardest part in practice”? That seems pretty counterintuitive to me. It’s easy to imagine descendants of today’s models that don’t do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn’t be “philosophically competent”.
What are you buying time to do? I don’t understand how you’re proposing spending the “3 years of time prior to needing to build substantially superhuman AIs”. Is it on alignment for those superhuman AIs?
You mention having 3 years, but then you say “More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years.” I found this confusing.
What do you mean by “a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point.” It seems easier to mitigate which risks prior to what point? And why? I didn’t follow this.
How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you’re describing here. What techniques are you imagining using?
I would say a mixture of moonshots and “doing huge amounts of science”. Honestly, we don’t have amazing proposals here, so the main plan is to just do huge amounts of R&D with our AIs. I have some specific proposals, but they aren’t amazing.
I agree this is unsatisfying, though we do have some idea of how this could work. (Edit: and I plan on writing some of this up later.)
I agree this is a weak point of this proposal, though notably, it isn’t as though most realistic proposals avoid a hole at least this large. : (
Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
We could hit just accelerating (and not egregiously misaligned) as an earlier point, but I think it’s nice to explicitly talk about the end state. And, I think this is a good end state to end for as it allows for approximately full retirement of human technical work. And it allows for (e.g.) working toward running a whole AI research program for the equivalent of 1000s of subjective years because these AIs don’t need human help to function. I think this probably isn’t needed (we probably need much less time), but it is a somewhat nice option to have.
Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control “human-obsoleting” AIs?
I’m imagining that we were already controlling AIs as capable as the final AIs we target and the change is just that we actually should be willing to defer to them. So, I think control should be doable for AIs which are of that capability level.
Further, I’d like to note that I’m not necessarily imagining that we have to go for superintelligence as opposed to other next objectives. The AIs can figure this out.
Here’s one possible progression:
AIs that substantially accelerate R&D, we use control
AIs that seem as smart as top-human experts, we use control
AIs that seem as smart as top-human experts, but we’ve ruled out egregious misalignment and could back off of control
AIs that seem as smart as top-human experts, but we are justifiably will to defer to them on everything
Why do you “expect that ruling out egregious misalignment is the hardest part in practice”? That seems pretty counterintuitive to me. It’s easy to imagine descendants of today’s models that don’t do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn’t be “philosophically competent”.
I’m not imagining these AIs being our successors which do what they want—the goal is more that they do a good job of obeying instructions in a reasonable and high-integrity way.
I don’t think the values of the AIs matter exactly if ensure they do a good job of obeying instructions in a reasonable way.
And note that if we avoid egregious misalignment then what we see in evaluations and training is roughly what we’ll get—the AIs won’t be conspiring against our testing if we avoid egregious misalignment
When I say “philosophically competent”, I mean good at thinking through tricky and more philosophical questions. This requires AIs to understand our values, but not necessarily to share them. (E.g. we might prefer our AIs to be more myopic / deontologically non-powerseeking in various ways.)
As far as why I think this is the hardest part, I think it might suffice to:
Rule out egregious misalignment
Test if your AIs do good things in normal cases you understand (which should be most cases as these AIs aren’t smarter than humans qualitatively)
Check if AIs are good at tricky hard-to-check philosophy-style problems by creating test beds that we happen to be able to check well for various reasons (e.g. string theory)
If checks indicate issues, iterate on our process for creating AIs. Also do some science on when these sort of checks might get overfit and avoid issues from this.
What are you buying time to do? I don’t understand how you’re proposing spending the “3 years of time prior to needing to build substantially superhuman AIs”. Is it on alignment for those superhuman AIs?
The proposal is:
Buy a bunch of time.
Use this time to get human-obsoleting AIs we can defer to.
Also ensure that these AIs get a bunch of time (>3 years). Ask these AIs what we should do, including how we should spend this time. These AIs could end up deciding to try to stall for more time.
So the time could be on figuring out alignment of superhuman AIs, or other stuff.
You mention having 3 years, but then you say “More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years.” I found this confusing.
The 3 years is after having human-obsoleting AIs which is our target end state. I’m imagining targeting more like 5-10 years for the overall acute risk period until we have to build wildly smarter AIs. I just want to make sure these AIs have enough time.
What do you mean by “a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point.” It seems easier to mitigate which risks prior to what point? And why? I didn’t follow this.
Misalignment risks seem much higher if AIs are much smarter than you. So, it is easier to reduce risks while effectively using AIs prior to the point where they are wildly superhuman.
Prior to wildly superhuman AI, you can likely use control which makes a bunch of stuff much easier. But, this doesn’t get you to the point where you can defer to such AIs. As far as why I think actual alignment is easier on such AIs, it just seems much easier to ensure alignment if we can understand the actions our AIs are taking (at least roughly). More generally, I just think much smarter AIs will be much more likely to be schemers given what ends up being the default training approach.
I got a bit lost in understanding your exit plan. You write
Some questions about this and the text that comes after it:
How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you’re describing here. What techniques are you imagining using?
Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control “human-obsoleting” AIs?
Why do you “expect that ruling out egregious misalignment is the hardest part in practice”? That seems pretty counterintuitive to me. It’s easy to imagine descendants of today’s models that don’t do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn’t be “philosophically competent”.
What are you buying time to do? I don’t understand how you’re proposing spending the “3 years of time prior to needing to build substantially superhuman AIs”. Is it on alignment for those superhuman AIs?
You mention having 3 years, but then you say “More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years.” I found this confusing.
What do you mean by “a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point.” It seems easier to mitigate which risks prior to what point? And why? I didn’t follow this.
I would say a mixture of moonshots and “doing huge amounts of science”. Honestly, we don’t have amazing proposals here, so the main plan is to just do huge amounts of R&D with our AIs. I have some specific proposals, but they aren’t amazing.
I agree this is unsatisfying, though we do have some idea of how this could work. (Edit: and I plan on writing some of this up later.)
I agree this is a weak point of this proposal, though notably, it isn’t as though most realistic proposals avoid a hole at least this large. : (
We could hit just accelerating (and not egregiously misaligned) as an earlier point, but I think it’s nice to explicitly talk about the end state. And, I think this is a good end state to end for as it allows for approximately full retirement of human technical work. And it allows for (e.g.) working toward running a whole AI research program for the equivalent of 1000s of subjective years because these AIs don’t need human help to function. I think this probably isn’t needed (we probably need much less time), but it is a somewhat nice option to have.
I’m imagining that we were already controlling AIs as capable as the final AIs we target and the change is just that we actually should be willing to defer to them. So, I think control should be doable for AIs which are of that capability level.
Further, I’d like to note that I’m not necessarily imagining that we have to go for superintelligence as opposed to other next objectives. The AIs can figure this out.
Here’s one possible progression:
AIs that substantially accelerate R&D, we use control
AIs that seem as smart as top-human experts, we use control
AIs that seem as smart as top-human experts, but we’ve ruled out egregious misalignment and could back off of control
AIs that seem as smart as top-human experts, but we are justifiably will to defer to them on everything
I’m not imagining these AIs being our successors which do what they want—the goal is more that they do a good job of obeying instructions in a reasonable and high-integrity way.
I don’t think the values of the AIs matter exactly if ensure they do a good job of obeying instructions in a reasonable way.
And note that if we avoid egregious misalignment then what we see in evaluations and training is roughly what we’ll get—the AIs won’t be conspiring against our testing if we avoid egregious misalignment
When I say “philosophically competent”, I mean good at thinking through tricky and more philosophical questions. This requires AIs to understand our values, but not necessarily to share them. (E.g. we might prefer our AIs to be more myopic / deontologically non-powerseeking in various ways.)
As far as why I think this is the hardest part, I think it might suffice to:
Rule out egregious misalignment
Test if your AIs do good things in normal cases you understand (which should be most cases as these AIs aren’t smarter than humans qualitatively)
Check if AIs are good at tricky hard-to-check philosophy-style problems by creating test beds that we happen to be able to check well for various reasons (e.g. string theory)
If checks indicate issues, iterate on our process for creating AIs. Also do some science on when these sort of checks might get overfit and avoid issues from this.
The proposal is:
Buy a bunch of time.
Use this time to get human-obsoleting AIs we can defer to.
Also ensure that these AIs get a bunch of time (>3 years). Ask these AIs what we should do, including how we should spend this time. These AIs could end up deciding to try to stall for more time.
So the time could be on figuring out alignment of superhuman AIs, or other stuff.
The 3 years is after having human-obsoleting AIs which is our target end state. I’m imagining targeting more like 5-10 years for the overall acute risk period until we have to build wildly smarter AIs. I just want to make sure these AIs have enough time.
Misalignment risks seem much higher if AIs are much smarter than you. So, it is easier to reduce risks while effectively using AIs prior to the point where they are wildly superhuman.
Prior to wildly superhuman AI, you can likely use control which makes a bunch of stuff much easier. But, this doesn’t get you to the point where you can defer to such AIs. As far as why I think actual alignment is easier on such AIs, it just seems much easier to ensure alignment if we can understand the actions our AIs are taking (at least roughly). More generally, I just think much smarter AIs will be much more likely to be schemers given what ends up being the default training approach.
It seems like I didn’t do a good job of explaining the exit plan!
I’ll need to do a better job of explaining this in the future. (I’ll respond to some of these specific points in a bit.)