Thanks for writing this! I appreciate the effort to make your perspective more transparent (and implicitly Anthropic’s perspective as well). In this comment, I’ll explain my two most important concerns with this proposal:
Fully state-proof security seems crucial at ASL-4, not only at ASL-5
You should have a clear exit plan and I disagree with what seems to be the proposal (fully deferring to the government on handling ASL-5 seems insufficient)
I have a variety of other takes on this proposal as well as various notes, but I decided to start by just writing up these two thoughts. I’ll very likely add more comments in a bit.
Fully state-proof security seems crucial at ASL-4, not only at ASL-5
When you have AIs which can massively accelerate total AI capabilities R&D production (>5x)[1], I think it is crucial that such systems are secure against very high-resource state actor attacks. (From the RAND report, this would be >=SL5[2].) This checklist seems to assume that this level of security isn’t needed until ASL-5, but you also say that ASL-5 is “broadly superhuman” so it seems likely that dramatic acceleration occurs before then. You also say various other things that seem to imply that early TAI could dramatically accelerate AI R&D.
I think having extreme security at the point of substantial acceleration is the most important intervention on current margins in short timelines. So, I think it is important to ensure that this is correctly timed.
As far as the ASL-4 security bar, you say:
early TAI will likely require a stronger ASL-4 standard, under which we need to be capable of defending against all but the most sophisticated nation-state-level attacks
This seems to imply a level of security of around SL-4 from the RAND report which would be robust to routine state actor attacks, but not top priority attacks. This seems insufficient given the costs of such a system being stolen and how clearly appealing this would be as you note later:
We will need to be prepared for TAI-level model weights to be one of the most sought-after and geopolitically important resources in history.
This seems to imply that you’ll need more than SL5 security![3]
You should have a clear exit plan and I disagree with what seems to be the proposal
You say that at ASL-5 (after TAI):
Governments and other important organizations will likely be heavily invested in AI outcomes, largely foreclosing the need for us to make major decisions on our own. By this point, in most possible worlds, the most important decisions that the organization is going to make have already been made.
For us to judge whether this is a good overall proposal, it would help to have a clear exit plan including an argument that this exit state is acceptable. This allows us to discuss whether this exit state actually is acceptable and to analyze whether the earlier steps will plausibly lead to this state.
Further, it seems likely that people writing plans and proposals now are also the plans that the government might end up using! (After all, who will the relevant people in the US government ask for help in figuring out what to do?)
As far as I can tell, the proposed exit state you’re imagining is roughly “perfectly (scalably?) solve alignment (or just for substantially superhuman systems?) and then hand things off to the government”:
So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control.
(The proposal is also that we’ll have improved societal resilience and tried to address AI welfare concerns along the way. I left this out because it seemed less central.)
I worry about the proposal of “perfectly solve alignment”, particularly because you didn’t list any concrete plausible research bets which are being scoped out and tested in advance. (And my sense is that few plausible bets exist and all seem pretty likely to fail.)
Further, I think the default plan should be to ensure a delay (perhaps at least 5 years) prior to building wildly superhuman AI after full TAI. Wildly superhuman AI seems incredibly scary, even relative to AIs which are qualitatively comparably smart to top human experts. (This is not just for misalignment reasons, though this is part of it.) I think the “perfectly solve alignment then the government is in charge” plan plausibly leads to the default action being to build wildly superhuman AI as quickly as possible. So, I’d prefer a more specific plan which is opinionated about the transition to wildly superhuman AI.
My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them even on tricky, high-stakes questions where we can’t (or just don’t) check their outputs. For AIs to be human-obsoleting, they would need to outcompete top human experts, and thus probably can’t be much dumber than top human experts. Ideally, they also wouldn’t be much smarter than top human experts and preferably notably dumber while compensating with speed and knowledge (though the capability profile of such systems might be very spiky in practice). Being able to defer to AIs to this extent (we trust them at least as much as well-regarded humans in terms of judgment and alignment) is stronger than just ruling out egregious misalignment (e.g. it requires that AIs are sufficiently philosophically competent and actually try on near-impossible-to-check questions), though I expect that ruling out egregious misalignment is the hardest part in practice. Further, we would aim to have more than 3 years of time prior to needing to build substantially superhuman AIs to remain competitive after the point where we have such human-obsoleting and deferrable AIs.
As part of achieving my proposed exit plan, I would also focus much more on buying time when AIs are this powerful or nearly this powerful. Using huge amounts of AI labor to buy time could be done prior to achieving the full desiderata of this exit plan using approaches like AI Control. Part of this would involve increasing willingness to pay on coordination and non-proliferation such as making demonstrations of risks as you discuss in the section “Developing Clear Smoking Gun Demos for Emerging Risk Factors”. (That said, this seems focused on speculative risks, and I’d also potentially expand this to demonstrations of the most concerning capabilities from a national security perspective.) But beyond increasing willingness to pay, I think it also seems likely to be important to work (using vast amounts of AI labor) on reducing the cost of coordination and non-proliferation such as by building better monitoring approaches (e.g., better satellite tracking of data centers), improving verification approaches, building up carrots for coordination, and potentially building up sticks to reduce the cost of enforcing non-proliferation.
I think this is a good proposal because it seems as though a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point. The AIs we defer to could end up recommending a wide variety of proposals (buying more time, emulated minds, developing totally different AI paradigms), not just doing huge amounts of research into aligning superintelligence. Such AIs would always have an option of trying to run themselves much faster to accomplish very difficult research which requires substantial serial time. We wouldn’t need to be able to check their work or their intermediate states which improves the situation considerably.
More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years. Edit: so it would be good to buy time, first for ensuring we have human obsoleting AIs we can defer to and then so that these AIs can have enough time to figure out what to do.
Buck and I are writing an overall short plan for (technical) AI safety and in this we’ll discuss all aspects of this in more detail.
I’m referring to AIs making humans working on improving AI from the software side >5x more productive. It is plausible (perhaps 40% in short timelines) that algorithmic advances will in practice slow down dramatically such that software improvements aren’t an important input, but I think you should be robust to the likely outcome in which AI R&D is very important.
I think you likely need substantially more than SL5 security (security sufficient to resist against unprecedentedly well-resourced attacks) within a short period after you have AIs which can massively (>10x) accelerate R&D in key domains (AI, weapons, cyber offense), but you don’t need it immediately. It’s probably a good 80⁄20 to aim for SL5 and then greater security within a year. This could plausibly be insufficient and is probably objectively unacceptably risky (e.g., maybe the RSP should demand stronger with a target of ensuring less than a 5% chance the key models are stolen), but something like this seems to get most of the benefit and even this 80⁄20 is still very ambitious.
I got a bit lost in understanding your exit plan. You write
My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them
Some questions about this and the text that comes after it:
How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you’re describing here. What techniques are you imagining using?
Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control “human-obsoleting” AIs?
Why do you “expect that ruling out egregious misalignment is the hardest part in practice”? That seems pretty counterintuitive to me. It’s easy to imagine descendants of today’s models that don’t do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn’t be “philosophically competent”.
What are you buying time to do? I don’t understand how you’re proposing spending the “3 years of time prior to needing to build substantially superhuman AIs”. Is it on alignment for those superhuman AIs?
You mention having 3 years, but then you say “More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years.” I found this confusing.
What do you mean by “a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point.” It seems easier to mitigate which risks prior to what point? And why? I didn’t follow this.
How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you’re describing here. What techniques are you imagining using?
I would say a mixture of moonshots and “doing huge amounts of science”. Honestly, we don’t have amazing proposals here, so the main plan is to just do huge amounts of R&D with our AIs. I have some specific proposals, but they aren’t amazing.
I agree this is unsatisfying, though we do have some idea of how this could work. (Edit: and I plan on writing some of this up later.)
I agree this is a weak point of this proposal, though notably, it isn’t as though most realistic proposals avoid a hole at least this large. : (
Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
We could hit just accelerating (and not egregiously misaligned) as an earlier point, but I think it’s nice to explicitly talk about the end state. And, I think this is a good end state to end for as it allows for approximately full retirement of human technical work. And it allows for (e.g.) working toward running a whole AI research program for the equivalent of 1000s of subjective years because these AIs don’t need human help to function. I think this probably isn’t needed (we probably need much less time), but it is a somewhat nice option to have.
Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control “human-obsoleting” AIs?
I’m imagining that we were already controlling AIs as capable as the final AIs we target and the change is just that we actually should be willing to defer to them. So, I think control should be doable for AIs which are of that capability level.
Further, I’d like to note that I’m not necessarily imagining that we have to go for superintelligence as opposed to other next objectives. The AIs can figure this out.
Here’s one possible progression:
AIs that substantially accelerate R&D, we use control
AIs that seem as smart as top-human experts, we use control
AIs that seem as smart as top-human experts, but we’ve ruled out egregious misalignment and could back off of control
AIs that seem as smart as top-human experts, but we are justifiably will to defer to them on everything
Why do you “expect that ruling out egregious misalignment is the hardest part in practice”? That seems pretty counterintuitive to me. It’s easy to imagine descendants of today’s models that don’t do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn’t be “philosophically competent”.
I’m not imagining these AIs being our successors which do what they want—the goal is more that they do a good job of obeying instructions in a reasonable and high-integrity way.
I don’t think the values of the AIs matter exactly if ensure they do a good job of obeying instructions in a reasonable way.
And note that if we avoid egregious misalignment then what we see in evaluations and training is roughly what we’ll get—the AIs won’t be conspiring against our testing if we avoid egregious misalignment
When I say “philosophically competent”, I mean good at thinking through tricky and more philosophical questions. This requires AIs to understand our values, but not necessarily to share them. (E.g. we might prefer our AIs to be more myopic / deontologically non-powerseeking in various ways.)
As far as why I think this is the hardest part, I think it might suffice to:
Rule out egregious misalignment
Test if your AIs do good things in normal cases you understand (which should be most cases as these AIs aren’t smarter than humans qualitatively)
Check if AIs are good at tricky hard-to-check philosophy-style problems by creating test beds that we happen to be able to check well for various reasons (e.g. string theory)
If checks indicate issues, iterate on our process for creating AIs. Also do some science on when these sort of checks might get overfit and avoid issues from this.
What are you buying time to do? I don’t understand how you’re proposing spending the “3 years of time prior to needing to build substantially superhuman AIs”. Is it on alignment for those superhuman AIs?
The proposal is:
Buy a bunch of time.
Use this time to get human-obsoleting AIs we can defer to.
Also ensure that these AIs get a bunch of time (>3 years). Ask these AIs what we should do, including how we should spend this time. These AIs could end up deciding to try to stall for more time.
So the time could be on figuring out alignment of superhuman AIs, or other stuff.
You mention having 3 years, but then you say “More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years.” I found this confusing.
The 3 years is after having human-obsoleting AIs which is our target end state. I’m imagining targeting more like 5-10 years for the overall acute risk period until we have to build wildly smarter AIs. I just want to make sure these AIs have enough time.
What do you mean by “a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point.” It seems easier to mitigate which risks prior to what point? And why? I didn’t follow this.
Misalignment risks seem much higher if AIs are much smarter than you. So, it is easier to reduce risks while effectively using AIs prior to the point where they are wildly superhuman.
Prior to wildly superhuman AI, you can likely use control which makes a bunch of stuff much easier. But, this doesn’t get you to the point where you can defer to such AIs. As far as why I think actual alignment is easier on such AIs, it just seems much easier to ensure alignment if we can understand the actions our AIs are taking (at least roughly). More generally, I just think much smarter AIs will be much more likely to be schemers given what ends up being the default training approach.
@ryan_greenblatt one thing I’m curious about is when/how the government plays a role in your plan.
I think Sam is likely correct in pointing out that the influence exerted by you (as an individual), Sam (as an individual), or even Anthropic (as an institution) likely goes down considerably if/once governments get super involved.
I still agree with your point about how having an exit plan is still valuable (and indeed I do expect governments to be asking technical experts about their opinions RE what to do, though I also expect a bunch of DC people who know comparatively little about frontier AI systems but have long-standing relationships in the national security world will have a lot of influence.)
My guess is that you think heavy government involvement should occur for before/during the creation of ASL-4 systems, since you’re pretty concerned about risks from ASL-4 systems being developed in non-SL5 contexts.
In general, I’d be interested in seeing more about how you (and Buck) are thinking about policy stuff + government involvement. My impression is that you two have spent a lot of time thinking about how AI control fits into a broader strategic context, with that broader strategic context depending a lot on how governments act/react.
And I suspect readers will be better able to evaluate the AI control plan if some of the assumptions/expectations around government involvement are spelled out more clearly. (Put differently, I think it’s pretty hard to evaluate “how excited should I be about the AI control agenda” without understanding who is responsible for doing the AI control stuff, what’s going on with race dynamics, etc.)
My guess is that you think heavy government involvement should occur for before/during the creation of ASL-4 systems, since you’re pretty concerned about risks from ASL-4 systems being developed in non-SL5 contexts.
Yes, I think heavy government should occur once AIs can substantially accelerate general purpose R&D and AI R&D in particular. I think the occurs at some point during ASL-4.
In practice, there might be a lag between when government should get involve and when it really does get involved such that I think companies should be prepared to implement SL5 without heavy government assistance. I think SL5 with involve massive operating cost, particularly if implemented on short notice, but should be possible for a competent actor to implement with a big effort.
(I’m also somewhat skeptical that the government will actually be that helpful in implementing SL5 relative to just hiring people the relevant expertise who will often be formerly working for various government. The difficulty in SL5 implementation also depends heavily on what costs you’re willing to accept: full airgapping is conceptually simple and should be workable, but prevents serving a public API.)
In general, I’d be interested in seeing more about how you (and Buck) are thinking about policy stuff + government involvement.
I don’t think we should get into this here, but we are in fact thinking about these topics and will likely discuss this more in future posts.
And I suspect readers will be better able to evaluate the AI control plan if some of the assumptions/expectations around government involvement are spelled out more clearly.
Agreed, though I think that “do something like control” is more robust than “the AI control plan” (which we haven’t even really clearly spelled out publicly, though we do have something in mind).
As far as security, perhaps part of what is going on is that you expect that achieving this high bar of security is too expensive:
ASL-4 is much more demanding and represents a rough upper limit on what we expect to be able to implement without heavily interfering with our research and deployment efforts.
My sense is indeed that SL5 level security would be a large tax to operate under, particularly when implemented in a hurry. However, I think this is also a natural point at which national security concerns become large and commercialization is likely to greatly reduce.
Thanks for writing this! I appreciate the effort to make your perspective more transparent (and implicitly Anthropic’s perspective as well). In this comment, I’ll explain my two most important concerns with this proposal:
Fully state-proof security seems crucial at ASL-4, not only at ASL-5
You should have a clear exit plan and I disagree with what seems to be the proposal (fully deferring to the government on handling ASL-5 seems insufficient)
I have a variety of other takes on this proposal as well as various notes, but I decided to start by just writing up these two thoughts. I’ll very likely add more comments in a bit.
Fully state-proof security seems crucial at ASL-4, not only at ASL-5
When you have AIs which can massively accelerate total AI capabilities R&D production (>5x)[1], I think it is crucial that such systems are secure against very high-resource state actor attacks. (From the RAND report, this would be >=SL5[2].) This checklist seems to assume that this level of security isn’t needed until ASL-5, but you also say that ASL-5 is “broadly superhuman” so it seems likely that dramatic acceleration occurs before then. You also say various other things that seem to imply that early TAI could dramatically accelerate AI R&D.
I think having extreme security at the point of substantial acceleration is the most important intervention on current margins in short timelines. So, I think it is important to ensure that this is correctly timed.
As far as the ASL-4 security bar, you say:
This seems to imply a level of security of around SL-4 from the RAND report which would be robust to routine state actor attacks, but not top priority attacks. This seems insufficient given the costs of such a system being stolen and how clearly appealing this would be as you note later:
We will need to be prepared for TAI-level model weights to be one of the most sought-after and geopolitically important resources in history.
This seems to imply that you’ll need more than SL5 security![3]
You should have a clear exit plan and I disagree with what seems to be the proposal
You say that at ASL-5 (after TAI):
For us to judge whether this is a good overall proposal, it would help to have a clear exit plan including an argument that this exit state is acceptable. This allows us to discuss whether this exit state actually is acceptable and to analyze whether the earlier steps will plausibly lead to this state.
Further, it seems likely that people writing plans and proposals now are also the plans that the government might end up using! (After all, who will the relevant people in the US government ask for help in figuring out what to do?)
As far as I can tell, the proposed exit state you’re imagining is roughly “perfectly (scalably?) solve alignment (or just for substantially superhuman systems?) and then hand things off to the government”:
(The proposal is also that we’ll have improved societal resilience and tried to address AI welfare concerns along the way. I left this out because it seemed less central.)
I worry about the proposal of “perfectly solve alignment”, particularly because you didn’t list any concrete plausible research bets which are being scoped out and tested in advance. (And my sense is that few plausible bets exist and all seem pretty likely to fail.)
Further, I think the default plan should be to ensure a delay (perhaps at least 5 years) prior to building wildly superhuman AI after full TAI. Wildly superhuman AI seems incredibly scary, even relative to AIs which are qualitatively comparably smart to top human experts. (This is not just for misalignment reasons, though this is part of it.) I think the “perfectly solve alignment then the government is in charge” plan plausibly leads to the default action being to build wildly superhuman AI as quickly as possible. So, I’d prefer a more specific plan which is opinionated about the transition to wildly superhuman AI.
My preferred exit plan is to build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer to them even on tricky, high-stakes questions where we can’t (or just don’t) check their outputs. For AIs to be human-obsoleting, they would need to outcompete top human experts, and thus probably can’t be much dumber than top human experts. Ideally, they also wouldn’t be much smarter than top human experts and preferably notably dumber while compensating with speed and knowledge (though the capability profile of such systems might be very spiky in practice). Being able to defer to AIs to this extent (we trust them at least as much as well-regarded humans in terms of judgment and alignment) is stronger than just ruling out egregious misalignment (e.g. it requires that AIs are sufficiently philosophically competent and actually try on near-impossible-to-check questions), though I expect that ruling out egregious misalignment is the hardest part in practice. Further, we would aim to have more than 3 years of time prior to needing to build substantially superhuman AIs to remain competitive after the point where we have such human-obsoleting and deferrable AIs.
As part of achieving my proposed exit plan, I would also focus much more on buying time when AIs are this powerful or nearly this powerful. Using huge amounts of AI labor to buy time could be done prior to achieving the full desiderata of this exit plan using approaches like AI Control. Part of this would involve increasing willingness to pay on coordination and non-proliferation such as making demonstrations of risks as you discuss in the section “Developing Clear Smoking Gun Demos for Emerging Risk Factors”. (That said, this seems focused on speculative risks, and I’d also potentially expand this to demonstrations of the most concerning capabilities from a national security perspective.) But beyond increasing willingness to pay, I think it also seems likely to be important to work (using vast amounts of AI labor) on reducing the cost of coordination and non-proliferation such as by building better monitoring approaches (e.g., better satellite tracking of data centers), improving verification approaches, building up carrots for coordination, and potentially building up sticks to reduce the cost of enforcing non-proliferation.
I think this is a good proposal because it seems as though a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point. The AIs we defer to could end up recommending a wide variety of proposals (buying more time, emulated minds, developing totally different AI paradigms), not just doing huge amounts of research into aligning superintelligence. Such AIs would always have an option of trying to run themselves much faster to accomplish very difficult research which requires substantial serial time. We wouldn’t need to be able to check their work or their intermediate states which improves the situation considerably.
More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years. Edit: so it would be good to buy time, first for ensuring we have human obsoleting AIs we can defer to and then so that these AIs can have enough time to figure out what to do.
Buck and I are writing an overall short plan for (technical) AI safety and in this we’ll discuss all aspects of this in more detail.
I’m referring to AIs making humans working on improving AI from the software side >5x more productive. It is plausible (perhaps 40% in short timelines) that algorithmic advances will in practice slow down dramatically such that software improvements aren’t an important input, but I think you should be robust to the likely outcome in which AI R&D is very important.
Note that SL5 is not ASL-5!
I think you likely need substantially more than SL5 security (security sufficient to resist against unprecedentedly well-resourced attacks) within a short period after you have AIs which can massively (>10x) accelerate R&D in key domains (AI, weapons, cyber offense), but you don’t need it immediately. It’s probably a good 80⁄20 to aim for SL5 and then greater security within a year. This could plausibly be insufficient and is probably objectively unacceptably risky (e.g., maybe the RSP should demand stronger with a target of ensuring less than a 5% chance the key models are stolen), but something like this seems to get most of the benefit and even this 80⁄20 is still very ambitious.
I got a bit lost in understanding your exit plan. You write
Some questions about this and the text that comes after it:
How do you achieve such alignment? You wrote that you worry about the proposal of perfectly + scalably solving alignment, but I worry about how to achieve even the imperfect alignment of human-ish-level AIs that you’re describing here. What techniques are you imagining using?
Why do these AIs need to be human-obsoleting? Why not just human-accelerating?
Why does your exit plan involve using powerful and aligned AIs to prepare for superintelligence, rather than merely using controlled AIs of that capability level? Do you think that it would be hard/dangerous to try to control “human-obsoleting” AIs?
Why do you “expect that ruling out egregious misalignment is the hardest part in practice”? That seems pretty counterintuitive to me. It’s easy to imagine descendants of today’s models that don’t do anything egregious but have pretty different values from me and/or the general public; these AIs wouldn’t be “philosophically competent”.
What are you buying time to do? I don’t understand how you’re proposing spending the “3 years of time prior to needing to build substantially superhuman AIs”. Is it on alignment for those superhuman AIs?
You mention having 3 years, but then you say “More generally, it just seems really heuristically scary to very quickly go from AIs which aren’t much smarter than the best humans to AIs which are wildly smarter in only a few years.” I found this confusing.
What do you mean by “a high fraction of risk comes from building wildly superhuman AI and it seems much easier to mitigate risks prior to this point.” It seems easier to mitigate which risks prior to what point? And why? I didn’t follow this.
I would say a mixture of moonshots and “doing huge amounts of science”. Honestly, we don’t have amazing proposals here, so the main plan is to just do huge amounts of R&D with our AIs. I have some specific proposals, but they aren’t amazing.
I agree this is unsatisfying, though we do have some idea of how this could work. (Edit: and I plan on writing some of this up later.)
I agree this is a weak point of this proposal, though notably, it isn’t as though most realistic proposals avoid a hole at least this large. : (
We could hit just accelerating (and not egregiously misaligned) as an earlier point, but I think it’s nice to explicitly talk about the end state. And, I think this is a good end state to end for as it allows for approximately full retirement of human technical work. And it allows for (e.g.) working toward running a whole AI research program for the equivalent of 1000s of subjective years because these AIs don’t need human help to function. I think this probably isn’t needed (we probably need much less time), but it is a somewhat nice option to have.
I’m imagining that we were already controlling AIs as capable as the final AIs we target and the change is just that we actually should be willing to defer to them. So, I think control should be doable for AIs which are of that capability level.
Further, I’d like to note that I’m not necessarily imagining that we have to go for superintelligence as opposed to other next objectives. The AIs can figure this out.
Here’s one possible progression:
AIs that substantially accelerate R&D, we use control
AIs that seem as smart as top-human experts, we use control
AIs that seem as smart as top-human experts, but we’ve ruled out egregious misalignment and could back off of control
AIs that seem as smart as top-human experts, but we are justifiably will to defer to them on everything
I’m not imagining these AIs being our successors which do what they want—the goal is more that they do a good job of obeying instructions in a reasonable and high-integrity way.
I don’t think the values of the AIs matter exactly if ensure they do a good job of obeying instructions in a reasonable way.
And note that if we avoid egregious misalignment then what we see in evaluations and training is roughly what we’ll get—the AIs won’t be conspiring against our testing if we avoid egregious misalignment
When I say “philosophically competent”, I mean good at thinking through tricky and more philosophical questions. This requires AIs to understand our values, but not necessarily to share them. (E.g. we might prefer our AIs to be more myopic / deontologically non-powerseeking in various ways.)
As far as why I think this is the hardest part, I think it might suffice to:
Rule out egregious misalignment
Test if your AIs do good things in normal cases you understand (which should be most cases as these AIs aren’t smarter than humans qualitatively)
Check if AIs are good at tricky hard-to-check philosophy-style problems by creating test beds that we happen to be able to check well for various reasons (e.g. string theory)
If checks indicate issues, iterate on our process for creating AIs. Also do some science on when these sort of checks might get overfit and avoid issues from this.
The proposal is:
Buy a bunch of time.
Use this time to get human-obsoleting AIs we can defer to.
Also ensure that these AIs get a bunch of time (>3 years). Ask these AIs what we should do, including how we should spend this time. These AIs could end up deciding to try to stall for more time.
So the time could be on figuring out alignment of superhuman AIs, or other stuff.
The 3 years is after having human-obsoleting AIs which is our target end state. I’m imagining targeting more like 5-10 years for the overall acute risk period until we have to build wildly smarter AIs. I just want to make sure these AIs have enough time.
Misalignment risks seem much higher if AIs are much smarter than you. So, it is easier to reduce risks while effectively using AIs prior to the point where they are wildly superhuman.
Prior to wildly superhuman AI, you can likely use control which makes a bunch of stuff much easier. But, this doesn’t get you to the point where you can defer to such AIs. As far as why I think actual alignment is easier on such AIs, it just seems much easier to ensure alignment if we can understand the actions our AIs are taking (at least roughly). More generally, I just think much smarter AIs will be much more likely to be schemers given what ends up being the default training approach.
It seems like I didn’t do a good job of explaining the exit plan!
I’ll need to do a better job of explaining this in the future. (I’ll respond to some of these specific points in a bit.)
@ryan_greenblatt one thing I’m curious about is when/how the government plays a role in your plan.
I think Sam is likely correct in pointing out that the influence exerted by you (as an individual), Sam (as an individual), or even Anthropic (as an institution) likely goes down considerably if/once governments get super involved.
I still agree with your point about how having an exit plan is still valuable (and indeed I do expect governments to be asking technical experts about their opinions RE what to do, though I also expect a bunch of DC people who know comparatively little about frontier AI systems but have long-standing relationships in the national security world will have a lot of influence.)
My guess is that you think heavy government involvement should occur for before/during the creation of ASL-4 systems, since you’re pretty concerned about risks from ASL-4 systems being developed in non-SL5 contexts.
In general, I’d be interested in seeing more about how you (and Buck) are thinking about policy stuff + government involvement. My impression is that you two have spent a lot of time thinking about how AI control fits into a broader strategic context, with that broader strategic context depending a lot on how governments act/react.
And I suspect readers will be better able to evaluate the AI control plan if some of the assumptions/expectations around government involvement are spelled out more clearly. (Put differently, I think it’s pretty hard to evaluate “how excited should I be about the AI control agenda” without understanding who is responsible for doing the AI control stuff, what’s going on with race dynamics, etc.)
Yes, I think heavy government should occur once AIs can substantially accelerate general purpose R&D and AI R&D in particular. I think the occurs at some point during ASL-4.
In practice, there might be a lag between when government should get involve and when it really does get involved such that I think companies should be prepared to implement SL5 without heavy government assistance. I think SL5 with involve massive operating cost, particularly if implemented on short notice, but should be possible for a competent actor to implement with a big effort.
(I’m also somewhat skeptical that the government will actually be that helpful in implementing SL5 relative to just hiring people the relevant expertise who will often be formerly working for various government. The difficulty in SL5 implementation also depends heavily on what costs you’re willing to accept: full airgapping is conceptually simple and should be workable, but prevents serving a public API.)
I don’t think we should get into this here, but we are in fact thinking about these topics and will likely discuss this more in future posts.
Agreed, though I think that “do something like control” is more robust than “the AI control plan” (which we haven’t even really clearly spelled out publicly, though we do have something in mind).
As far as security, perhaps part of what is going on is that you expect that achieving this high bar of security is too expensive:
My sense is indeed that SL5 level security would be a large tax to operate under, particularly when implemented in a hurry. However, I think this is also a natural point at which national security concerns become large and commercialization is likely to greatly reduce.