I left an earlier comment on more important and clearer points, but I thought I would also leave a comment with some more tentative remarks, explanations of where I’m confused, and notes.
The story for ASL-4 safety seems unclear and/or unlikely to provide a “reasonable” level of safety
In this checklist, perhaps the most important target for ASL-4 technical safety measures (not including security) is having the measures to ensure that ASL-4 is safe with respect to misalignment. From what I could tell, you were imagining:
Hopefully mechanistic interpretability will allow for a high-confidence case for safety.
If that fails or is insufficient, then we’ll need other basic science such as “scaling trends of risk factors in model organisms, scaling trends of the effectiveness of oversight and monitoring, the basic science of generalization, novel honeypot-style evaluation methods, high-confidence ‘nerfing’ (i.e., capability deletion), and high-level less-mechanistic interpretability methods like influence functions”
It was unclear to me whether you expected control-style cases to work for an ASL-4/TAI safety case or if you expected it to only work earlier.
I think mechanistic interpretability is unlikely to provide a high degree of assurance (<10% probability of a 10x risk reduction, and <20% probability of a 3x risk reduction where the risk reduction is relative to other more simple methods). So, I think the first bullet will likely be unimportant. Then, the question is whether the second bullet will suffice or whether you are leaning on control.
I was generally a bit unclear on what you were imagining given that you didn’t describe a specific list of possible approaches and I generally didn’t see a clear story for how the safety case would work. At a more basic level, I expect that the main hope for the first TAIs will end up being AI control in short timelines, and this seems underemphasized if you think this, so likely we disagree or I ended up confused about what you’re imagining.
Personally, I’d advocate for using roughly three pillars for ASL-4:
AI control which can hopefully provide a greater than 10x risk reduction given substantial effort on implementation and some tax. (And this risk reduction can plausibly be mostly pretty evaluable.)
Model organisms style work where you look for evidence of egregious misalignment (both in constructed cases and by adversarially evaluating the actual production models you’ve deployed).
Ad hoc methods that seem like they might help (aka the bag of tricks), but which aren’t particularly principled and which might be very difficult to evaluate (e.g. coup probes, generally trying to make training less exploitable because maybe this reduces misalignment risk, perhaps influence functions, try random stuff on model organisms and hope for transfer).
It seems likely to me that this proposal I’m outlining doesn’t provide a reasonable level of assurance (e.g. <0.5% takeover/escape/severe sabotage risk throughout ASL-4), but I think it could work at the start of ASL-4 and is reasonably likely to be the best we’ve got in short timelines.
Overall, I think that it will likely be quite hard to meet a reasonable level of assurance for ASL-4 with the technology we have at least in slightly more pessimistic worlds (worlds where there are some warning signs or the architecture is somewhat less favorable). So, plans should bear this in mind. I think extensive planning for when you can’t make reasonable safety targets seem quite important, including plans for when you find smoking gun evidence of egregious misalignment or plans for when the evidence is more confusing, but our countermeasures and tests aren’t sufficiently good to provide a reasonable degree of assurance.
I was confused by what capability level you were referring to by “early TAI” or I somewhat disagree with how lightly you treat TAI
You used the term “early TAI”. I would have thought this meant the first TAI where you defined TAI to be an AI which is a drop-in replacement for human workers, including in AI R&D. However, in the sections “Largely Solving Alignment Fine-Tuning for Early TAI” and “Rendering Early TAI Reliably Harmless”, you seemed to possibly imply that this was before TAI. In particular, you seemed to imply that scheming wouldn’t be a risk at the start of TAI (seems unclear to me given that the AIs are comparable to top human experts!) and you seemed to imply that such AIs wouldn’t speed up R&D that much (while I expect that drop-in replacements for top human AI researchers would pretty likely result in huge acceleration!).
Perhaps I was mostly confused by how you split things up across different headings?
This is also related to my earlier point on ASL-4 security.
“Intelligence explosion level” isn’t likely to be a meaningful threshold and TAI will likely greatly accelerate AI R&D
You seem to define TAI as “AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D”. This seems to imply that TAI would likely greatly accelerate AI R&D to the extent that it is possible to greatly accelerate with human labor (which seems like an assumption of the piece).
But later you say:
AI R&D is not automated to the point of allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible, but AI-augmented R&D is very significantly speeding up progress on both AI capabilities and AI safety.
I would have guessed that for this definition and the assumptions of the piece, you would expect >10x acceleration of AI R&D (perhaps 30x) given the definition of TAI.
Further “allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible” doesn’t seem like an interesting or important bar to me. If progress has already accelerated and the prediction is that it will keep accelerating, then the intelligence explosion is already happening (though it is unclear what the full extent of the explosion will be).
You seem to have some idea in mind of “intelligence-explosion level”, but it was unclear to me what this was supposed to mean.
Perhaps you mean something like “could the AIs fully autonomously do an intelligence explosion”? This seems like a mostly unimportant distinction given that projects can use human labor (even including rogue AI-run projects). I think a more natural question is how much the AIs can accelerate things over a human baseline and also how easy further algorithmic advances seem to be (or if we end up effectively having very minimal further returns to labor on AI R&D).
Overall, I expect all of this to be basically continuous such that we are already reasonably likely to be on the early part of a trajectory which is part of an intelligence explosion and this will be much more so the case once we have TAI.
(That said, note that at TAI I still expect things will take potentially a few years and potentially much longer if algorithmic improvements fizzle out rather than resulting in a hyperexponential trajectory without needing further hardware.)
A longer series of messy small notes and reactions
Here’s a long series of takes I had while reading this; this is ordered sequentially rather than by importance.
I removed takes/notes which I already covered above or in my earlier comment.
Here are some of the assumptions that the piece relies on. I don’t think any one of these is a certainty, but all of them are plausible enough to be worth taking seriously when making plans: Broadly human-level AI is possible. I’ll often refer to this as transformative AI (or TAI), roughly defined as AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.[1]
FWIW, broadly human-level AI being possible seems near certain to me (98% likely)
If TAI is possible, it will probably be developed this decade, in a business and policy and cultural context that’s not wildly different from today.
I interpret this statement to mean P(TAI before 2030 | TAI possible) > 50%. This seems a bit too high to me, though it is also in a list of things called plausible so I’m a bit confused. (Style-wise, I was a bit confused by what seems to be probabilities of probabilities.) I think P(TAI possible) is like 98%, so we can simplify to P(TAI before 2030) which I think is maybe 35%?
Powerful AI systems could be extraordinarily destructive if deployed carelessly, both because of new emerging risks and because of existing issues that become much more acute. This could be through misuse of weapons-related capabilities, by disrupting important balances of power in domains like cybersecurity or surveillance, or by any of a number of other means.
I’m surprised this doesn’t mention misalignment.
Also, I find “misuse of weapons-related capabilities” somewhat awkward. Is this referring to countries using AI for military or terrorists / lone actors use AIs for terrorism/similar? I think it’s unnatural to call countries using your AI for weapons R&D “misuse”, in the same way we don’t call stealing F-35 plans “misuse”.
Many systems at TAI and beyond, at least under the right circumstances, will be capable of operating more-or-less autonomously for long stretches in pursuit of big-picture, real-world goals. This magnifies these safety challenges.
Seems surprisingly weak if these systems are drop-in replacements for humans!
Our best models are broadly superhuman, warranting ASL-5 precautions, and they’re starting to be used in high-stakes settings. They’re able to take enormously impactful actions, potentially using real-world strategies or mechanisms that we deeply struggle to understand, at a pace we can’t keep up with.
Based on some aspects of how Sam talks about ASL-4, I’m tempted to interpret “broadly superhuman” to mean “at least as good as the best humans in all important domains”. However, the language he uses elsewhere and the specific words seem to actually mean “notably better than the best humans in all important domains”. Or perhaps even “much better than the best humans in all important domains” Some of the text later on is “first substantially superhuman models pose ASL-5-level risk” which seems to imply “notably better” or “much better” to me.
For now, I’ll assume Sam means “notably better” or “much better” as this seems like the most likely interpretation.
I think you shouldn’t build AIs which are this smart until you’ve had a long time with models comparable to top human scientists (perhaps >5 years of delay) and you should be making plans to avoid building such models and handle things with human-ish level models. I think you should try to avoid building AIs which are smarter than top human scientists and I don’t see a strong reason to expect you’ll need stronger AIs. (More on this...)
If Sam actually means “at least as good”, then I disagree with comments “if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2″ and “This is the endgame for our AI safety work: If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now.” If Sam means “notably better” or “much better”, then I think this seems about right. However, in this case, my comments about avoiding building such systems apply.
In particular, we should build external safeguards around our AI systems that are sufficient to prevent them from doing any serious harm, even if they are trying to cause serious harm.
+1
If they’re significantly too strict and trigger a clearly unwarranted pause, we pay a huge cost and threaten our credibility for no substantial upside.
Given how easy the RSP is to amend with approximately zero cost in most cases like this (clearly unwarranted pause), I think this is unlikely to happen in practice. I guess I could imagine some more specific commitment you make like the ASL-3 commitments being a problem, but I roughly expect procedural commitments for ASL-4 and beyond. (Other than perhaps security.)
I generally think we’re unlikely to meet a “reasonable” risk threshold for the key regimes and thus think that a lot of the action is in being super clear about loosening the RSP and exactly what that entails. Plausibly, reasonable actors should self-destruct in these cases, though I’m currently skeptical this will happen for any actor and I also lean toward this being a bad idea.
Overall, I’d like the structure of RSPs to look more like a procedural commitment than a specific commitment to countermeasures. You should explain your evaluations in advance and notify the public if you deviate from them.
One of Anthropic’s main distinguishing safety research bets is that we expect a deep effort into mechanistic interpretability to produce a near-uniquely valuable source of evidence about safety. Major successes in this direction, even if they fall short of our north-star enumerative safety goal (roughly, proving that a model has some property) would likely form some of the highest-confidence core pieces of a safety case.
I’m skeptical and this seems pretty unlikely.
(I think something involving model internals could end up being pretty useful, I don’t expect that Anthropic’s ongoing work in mech interp will help much with this.)
Safety cases for most deployments (i.e., any deployment where the model could be used for high-stakes tasks) will need to include evidence that our safety measures are highly robust.
I don’t think this is necessary exactly (see here). But, it would make it easier.
quantitatively forecast the effectiveness of our mitigations
This section seems to think that forecasting is key. Why? Can’t we just check if the mitigations are sufficient at the time without needing to extrapolate which imposes additional difficulties? I agree forecasting is useful.
Protecting algorithmic secrets [...] It is unclear how important this will be in the Chapter 1 regime, but since it is both quite difficult and likely to become quite important in Chapter 2, it is worth investing in significantly, if only as practice.
It seems like the proposal is to de facto assume current algorithmic secrets will be stolen by competent actors eventually. This is my guess for what Anthropic should do at the moment, but it is worth remarking on how scary this is.
We will likely not trust models with full high-bandwidth access to modify our infrastructure and codebase (barring major breakthroughs in the degree to which we can verify alignment-related properties of models), so engineer time will still be a binding constraint on a lot of what we do.
This seems to be implying that you won’t be able to get massive AI speed-ups because you won’t trust AIs in many domains. I think this will likely be seen as unacceptable and thus you should prepare for how to deploy AIs in these domains. (See also AI Control.)
Providing an independent perspective on any key safety decisions from people who were not involved in the development or execution of our plans. They are ultimately responsible for signing off on high-stakes decisions, like deployments of new frontier models.
I don’t think the board provides a real independent perspective at the moment, and I’m unsure if it ever will. (See also Zach’s comment.) The LTBT will likely just defer to the board in the current status quo (they should be appointing the board by ASL-4 from my understanding), so there is really just the board. I don’t think we should trust the board of Anthropic to be an independent check on Anthropic unless the specific board members have well-built up and independent views on AI safety (and ideally a majority of such board members). (By default I think Anthropic leadership will effectively control the board and convince the board that they should defer to experts who agree with the leadership.)
I think all of this is hard to achieve even if this is was a high priority, but the bottom line is that the board is unlikely to be an independent check. (Various employees could potentially be an independent check.)
On that note, I think the most urgent safety-related issue that Anthropic can’t directly address is the need for one or, ideally, several widely respected third-party organizations that can play this adjudication role competently.
Strong +1
we as an organization are very limited in what we can do to make this happen.
I disagree, I think there are things Anthropic could do that would help considerably. This could include:
Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I’m not sure exactly what is good here, but I don’t think Anthropic is as limited as you suggest.
This could lead us to cancel or dramatically revise major deployments. Doing so will inevitably be costly and could risk our viability in the worst cases
I think Anthropic should either heavily plan for scenarios where it self-destructs as an organization or should be prepared to proceed with a plan B that doesn’t meet an objectively reasonable safety bar (e.g. <1% lifetime takeover risk). (Or both.)
Developing Methods to Align a Substantially Superhuman AI
I think you should instead plan on not building such systems as there isn’t a clear reason why you need such systems and they seem super dangerous. That’s not to say that you shouldn’t also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.
Addressing AI Welfare as a Major Priority
I lean against focusing on AI welfare as a major priority. I’d target perhaps 2.5% of resources on AI welfare in the absence of outside pressure. (Zach also made a comment on this and my earlier post on AI welfare talks about resource allocation considerations some.)
I agree with large parts of this comment, but am confused by this:
I think you should instead plan on not building such systems as there isn’t a clear reason why you need such systems and they seem super dangerous. That’s not to say that you shouldn’t also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.
While I don’t endorse it due to disagreeing with some (stated and unstated) premises, I think there’s a locally valid line of reasoning that goes something like this:
if Anthropic finds itself in a world where it’s successfully built not-vastly-superhuman TAI, it seems pretty likely that other actors have also done so, or will do so relatively soon
it is now legible (to those paying attention) that we are in the Acute Risk Period
most other actors who have or will soon have TAI will be less safety-conscious than Anthropic
if nobody ends the Acute Risk Period, it seems pretty likely that one of those actors will do something stupid (like turn over their AI R&D efforts to their unaligned TAI), and then we all die
not-vastly-superhuman TAI will not be sufficient to prevent those actors from doing something stupid that ends the world
unfortunately, it seems like we have no choice but to make sure we’re the first to build superhuman TAI, to make sure the Acute Risk Period has a good ending
This seems like the pretty straightforward argument for racing, and if you have a pretty specific combination of beliefs about alignment difficulty, coordination difficulty, capability profiles, etc, I think it basically checks out.
I don’t know what set of beliefs implies that it’s much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place. (In particular, how does this end up with the world in a stable equilibrium that doesn’t immediately get knocked over by the second actor to reach TAI?)
I don’t know what set of beliefs implies that it’s much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place.
AIs which aren’t qualitatively much smarter than humans seem plausible to use reasonably effectively while keeping risk decently low (though still unacceptably risky in objective/absolute terms). Keeping risk low seems to require substantial effort, though it seems maybe achievable. Even with token effort, I think risk is “only” around 25% with such AIs because default methods likely avoid egregious misalignment (perhaps 30% chance of egregious misalignment with token effort and then some chance you get lucky for a 25% chance of risk overall).
Then given this, I have two objections to the story you seem to present:
AIs which aren’t qualitatively smarter than humans seem very useful and with some US government support could suffice to prevent proliferation. (Both greatly reduce the cost of non-proliferation while also substantially increasing willingness to pay with demos etc.)
Plans that don’t involve US government support while building crazy weapons/defense with wildly superhuman AIs involve commiting massive crimes and I think we should have a policy against this.
Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else’s crazily powerful AI. Building insane weapons/defenses requires US government consent (unless you’re commiting massive crimes which seems like a bad idea). Thus, you might as well go all the way to preventing much smarter AIs from being built (by anyone) for a while which seems possible with some US government support and the use of these human-ish level AIs.
(Responding in a consolidated way just to this comment.)
Ok, got it. I don’t think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan. It might become more willing than it is now (though I’m not hugely optimistic), but I currently don’t think as an institution it’s capable of executing on that kind of plan and don’t see why that will change in the next five years.
Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else’s crazily powerful AI.
I think I disagree with the framing (“crazy weapons/defenses”) but it does seem like you need some kind of qualitatively new technology. This could very well be social technology, rather than something more material.
Building insane weapons/defenses requires US government consent (unless you’re commiting massive crimes which seems like a bad idea).
I don’t think this is actually true, except in the trivial sense where we have a legal system that allows the government to decide approximately arbitrary behaviors are post-facto illegal if it feels strongly enough about it. Most new things are not explicitly illegal. But even putting that aside[1], I think this is ignoring the legal routes by which a qualitatively superhuman TAI might find to ending the Acute Risk Period, if it was so motivated.
(A reminder that I am not claiming this is Anthropic’s plan, nor would I endorse someone trying to build ASI to execute on this kind of plan.)
TBC, I don’t think there are plausible alternatives to at least some US government involvement which don’t require commiting a bunch of massive crimes.
I think there’s a very large difference between plans that involve nominal US government signoff on private actors doing things, in order to avoid comitting massive crimes (or to avoid the appearance of doing so), plans that involve the US government mostly just slowing things down or stopping people from doing things, and plans that involve the US government actually being the entity that makes high-context decisions about e.g. what values to to optimize for, given a slot into which to put values.
I agree that stories which require building things that look very obviously like “insane weapons/defenses” seem bad, both for obvious deontological reasons, but also I wouldn’t expect them to work well enough be worth it even under “naive” consequentialist analysis.
if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan
I’m basically imagining being able to use controlled AIs which aren’t qualitatively smarter than humans for whatever R&D purposes we want. (Though not applications like (e.g.) using smart AIs to pilot drone armies live.) Some of these applications will be riskier than others, but I think this can be done while managing risk to a moderate degree.
Bootstrapping to some extent should also be possible where you use the first controlled AIs to improve the safety of later deployments (both improving control and possibly alignment).
With (properly motivated) qualitatively wildly superhuman AI, you can end the Acute Risk Period using means which aren’t massive crimes despite not collaborating with the US government. This likely involves novel social technology. More minimally, if you did have a sufficiently aligned AI of this power level, you could just get it to work on ending the Acute Risk Period in a basically legal and non-norms-violating way. (Where e.g. super persuasion would clearly violate norms.)
I think that even having the ability to easily take over the world as a private actor is pretty norms violating. I’m unsure about the claim that if you put this aside, there is a way to end the acute risk period (edit: without US government collaboration and) without needing truly insanely smart AIs. I suppose that if you go smart enough this is possible though pre-existing norms also just get more confusing in the regime where you can steer the world to whatever outcome you want.
So overall, I’m not sure I disagree with this perspective exactly. I think the overriding consideration for me is that this seems like a crazy and risky proposal at multiple levels.
To be clear, you are explicitly not endorsing this as a plan nor claiming this is Anthropic’s plan.
Something like that, though I’m much less sure about “non-norms-violating”, because many possible solutions seem like they’d involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology). Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new[1] social or material technology, but that does seem quite a bit harder.
I’m pretty uncertain about, if something like that ended up looking norm-violating, it’d be norm-violating like Uber was[2], or like super-persuasian. That question seems very contingent on empirical questions that I think we don’t have much insight into, right now.
I’m unsure about the claim that if you put this aside, there is a way to end the acute risk period without needing truly insanely smart AIs.
I didn’t mean to make the claim that there’s a way to end the acute risk period without needing truly insanely smart AIs (if you put aside centrally-illegal methods); rather, that an AI would probably need to be relatively low on the “smarter than humans” scale to need to resort to methods that were obviously illegal to end the acute risk period.
My proposal would roughly be that the US government (in collaboration with allies etc) enforces no one building AI which are qualitatively smarter than humans and this should be the default plan.
(This might be doable without government support via coordination between multiple labs, but I basically doubt it.)
Their could be multiple AI projects backed by the US+allies or just one, either could be workable in principle, though multiple seems tricky.
TBC, I don’t think there are plausible alternatives to at least some US government involvement which don’t require commiting a bunch of massive crimes.
I have a policy against commiting or recommending commiting massive crimes.
I left an earlier comment on more important and clearer points, but I thought I would also leave a comment with some more tentative remarks, explanations of where I’m confused, and notes.
The story for ASL-4 safety seems unclear and/or unlikely to provide a “reasonable” level of safety
In this checklist, perhaps the most important target for ASL-4 technical safety measures (not including security) is having the measures to ensure that ASL-4 is safe with respect to misalignment. From what I could tell, you were imagining:
Hopefully mechanistic interpretability will allow for a high-confidence case for safety.
If that fails or is insufficient, then we’ll need other basic science such as “scaling trends of risk factors in model organisms, scaling trends of the effectiveness of oversight and monitoring, the basic science of generalization, novel honeypot-style evaluation methods, high-confidence ‘nerfing’ (i.e., capability deletion), and high-level less-mechanistic interpretability methods like influence functions”
It was unclear to me whether you expected control-style cases to work for an ASL-4/TAI safety case or if you expected it to only work earlier.
I think mechanistic interpretability is unlikely to provide a high degree of assurance (<10% probability of a 10x risk reduction, and <20% probability of a 3x risk reduction where the risk reduction is relative to other more simple methods). So, I think the first bullet will likely be unimportant. Then, the question is whether the second bullet will suffice or whether you are leaning on control.
I was generally a bit unclear on what you were imagining given that you didn’t describe a specific list of possible approaches and I generally didn’t see a clear story for how the safety case would work. At a more basic level, I expect that the main hope for the first TAIs will end up being AI control in short timelines, and this seems underemphasized if you think this, so likely we disagree or I ended up confused about what you’re imagining.
Personally, I’d advocate for using roughly three pillars for ASL-4:
AI control which can hopefully provide a greater than 10x risk reduction given substantial effort on implementation and some tax. (And this risk reduction can plausibly be mostly pretty evaluable.)
Model organisms style work where you look for evidence of egregious misalignment (both in constructed cases and by adversarially evaluating the actual production models you’ve deployed).
Ad hoc methods that seem like they might help (aka the bag of tricks), but which aren’t particularly principled and which might be very difficult to evaluate (e.g. coup probes, generally trying to make training less exploitable because maybe this reduces misalignment risk, perhaps influence functions, try random stuff on model organisms and hope for transfer).
It seems likely to me that this proposal I’m outlining doesn’t provide a reasonable level of assurance (e.g. <0.5% takeover/escape/severe sabotage risk throughout ASL-4), but I think it could work at the start of ASL-4 and is reasonably likely to be the best we’ve got in short timelines.
Overall, I think that it will likely be quite hard to meet a reasonable level of assurance for ASL-4 with the technology we have at least in slightly more pessimistic worlds (worlds where there are some warning signs or the architecture is somewhat less favorable). So, plans should bear this in mind. I think extensive planning for when you can’t make reasonable safety targets seem quite important, including plans for when you find smoking gun evidence of egregious misalignment or plans for when the evidence is more confusing, but our countermeasures and tests aren’t sufficiently good to provide a reasonable degree of assurance.
I was confused by what capability level you were referring to by “early TAI” or I somewhat disagree with how lightly you treat TAI
You used the term “early TAI”. I would have thought this meant the first TAI where you defined TAI to be an AI which is a drop-in replacement for human workers, including in AI R&D. However, in the sections “Largely Solving Alignment Fine-Tuning for Early TAI” and “Rendering Early TAI Reliably Harmless”, you seemed to possibly imply that this was before TAI. In particular, you seemed to imply that scheming wouldn’t be a risk at the start of TAI (seems unclear to me given that the AIs are comparable to top human experts!) and you seemed to imply that such AIs wouldn’t speed up R&D that much (while I expect that drop-in replacements for top human AI researchers would pretty likely result in huge acceleration!).
Perhaps I was mostly confused by how you split things up across different headings?
This is also related to my earlier point on ASL-4 security.
“Intelligence explosion level” isn’t likely to be a meaningful threshold and TAI will likely greatly accelerate AI R&D
You seem to define TAI as “AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D”. This seems to imply that TAI would likely greatly accelerate AI R&D to the extent that it is possible to greatly accelerate with human labor (which seems like an assumption of the piece).
But later you say:
I would have guessed that for this definition and the assumptions of the piece, you would expect >10x acceleration of AI R&D (perhaps 30x) given the definition of TAI.
Further “allowing the kind of AI self-improvement that would lead to an intelligence explosion, if such a thing is possible” doesn’t seem like an interesting or important bar to me. If progress has already accelerated and the prediction is that it will keep accelerating, then the intelligence explosion is already happening (though it is unclear what the full extent of the explosion will be).
You seem to have some idea in mind of “intelligence-explosion level”, but it was unclear to me what this was supposed to mean.
Perhaps you mean something like “could the AIs fully autonomously do an intelligence explosion”? This seems like a mostly unimportant distinction given that projects can use human labor (even including rogue AI-run projects). I think a more natural question is how much the AIs can accelerate things over a human baseline and also how easy further algorithmic advances seem to be (or if we end up effectively having very minimal further returns to labor on AI R&D).
Overall, I expect all of this to be basically continuous such that we are already reasonably likely to be on the early part of a trajectory which is part of an intelligence explosion and this will be much more so the case once we have TAI.
(That said, note that at TAI I still expect things will take potentially a few years and potentially much longer if algorithmic improvements fizzle out rather than resulting in a hyperexponential trajectory without needing further hardware.)
See also the takeoff speeds report by Tom Davidson and this comment from Paul.
A longer series of messy small notes and reactions
Here’s a long series of takes I had while reading this; this is ordered sequentially rather than by importance.
I removed takes/notes which I already covered above or in my earlier comment.
FWIW, broadly human-level AI being possible seems near certain to me (98% likely)
I interpret this statement to mean P(TAI before 2030 | TAI possible) > 50%. This seems a bit too high to me, though it is also in a list of things called plausible so I’m a bit confused. (Style-wise, I was a bit confused by what seems to be probabilities of probabilities.) I think P(TAI possible) is like 98%, so we can simplify to P(TAI before 2030) which I think is maybe 35%?
I’m surprised this doesn’t mention misalignment.
Also, I find “misuse of weapons-related capabilities” somewhat awkward. Is this referring to countries using AI for military or terrorists / lone actors use AIs for terrorism/similar? I think it’s unnatural to call countries using your AI for weapons R&D “misuse”, in the same way we don’t call stealing F-35 plans “misuse”.
Seems surprisingly weak if these systems are drop-in replacements for humans!
Based on some aspects of how Sam talks about ASL-4, I’m tempted to interpret “broadly superhuman” to mean “at least as good as the best humans in all important domains”. However, the language he uses elsewhere and the specific words seem to actually mean “notably better than the best humans in all important domains”. Or perhaps even “much better than the best humans in all important domains” Some of the text later on is “first substantially superhuman models pose ASL-5-level risk” which seems to imply “notably better” or “much better” to me.
For now, I’ll assume Sam means “notably better” or “much better” as this seems like the most likely interpretation.
I think you shouldn’t build AIs which are this smart until you’ve had a long time with models comparable to top human scientists (perhaps >5 years of delay) and you should be making plans to avoid building such models and handle things with human-ish level models. I think you should try to avoid building AIs which are smarter than top human scientists and I don’t see a strong reason to expect you’ll need stronger AIs. (More on this...)
If Sam actually means “at least as good”, then I disagree with comments “if we have adequate safeguards available, that is probably only because we saw a surge of AI-accelerated safety R&D in Chapter 2″ and “This is the endgame for our AI safety work: If we haven’t succeeded decisively on the big core safety challenges by this point, there’s so much happening so fast and with such high stakes that we are unlikely to be able to recover from major errors now.” If Sam means “notably better” or “much better”, then I think this seems about right. However, in this case, my comments about avoiding building such systems apply.
+1
Given how easy the RSP is to amend with approximately zero cost in most cases like this (clearly unwarranted pause), I think this is unlikely to happen in practice. I guess I could imagine some more specific commitment you make like the ASL-3 commitments being a problem, but I roughly expect procedural commitments for ASL-4 and beyond. (Other than perhaps security.)
I generally think we’re unlikely to meet a “reasonable” risk threshold for the key regimes and thus think that a lot of the action is in being super clear about loosening the RSP and exactly what that entails. Plausibly, reasonable actors should self-destruct in these cases, though I’m currently skeptical this will happen for any actor and I also lean toward this being a bad idea.
Overall, I’d like the structure of RSPs to look more like a procedural commitment than a specific commitment to countermeasures. You should explain your evaluations in advance and notify the public if you deviate from them.
I’m skeptical and this seems pretty unlikely.
(I think something involving model internals could end up being pretty useful, I don’t expect that Anthropic’s ongoing work in mech interp will help much with this.)
I don’t think this is necessary exactly (see here). But, it would make it easier.
This section seems to think that forecasting is key. Why? Can’t we just check if the mitigations are sufficient at the time without needing to extrapolate which imposes additional difficulties? I agree forecasting is useful.
It seems like the proposal is to de facto assume current algorithmic secrets will be stolen by competent actors eventually. This is my guess for what Anthropic should do at the moment, but it is worth remarking on how scary this is.
This seems to be implying that you won’t be able to get massive AI speed-ups because you won’t trust AIs in many domains. I think this will likely be seen as unacceptable and thus you should prepare for how to deploy AIs in these domains. (See also AI Control.)
I don’t think the board provides a real independent perspective at the moment, and I’m unsure if it ever will. (See also Zach’s comment.) The LTBT will likely just defer to the board in the current status quo (they should be appointing the board by ASL-4 from my understanding), so there is really just the board. I don’t think we should trust the board of Anthropic to be an independent check on Anthropic unless the specific board members have well-built up and independent views on AI safety (and ideally a majority of such board members). (By default I think Anthropic leadership will effectively control the board and convince the board that they should defer to experts who agree with the leadership.)
I think all of this is hard to achieve even if this is was a high priority, but the bottom line is that the board is unlikely to be an independent check. (Various employees could potentially be an independent check.)
Strong +1
I disagree, I think there are things Anthropic could do that would help considerably. This could include:
Actively encouraging prospective employees to start or join third-party organizations rather than join Anthropic in cases where the employee might be interested in this and this could be a reasonable fit.
Better model access (either for anyone, just researchers, or just organizations with aspirations to become adjudicators)
Higher levels of certain types of transparency (e.g. being more transparent about the exact details of safety cases, open-sourcing evals (probably you just want to provide random IID subsets of the eval or to share high-level details and then share the exact implementation on request)).
I’m not sure exactly what is good here, but I don’t think Anthropic is as limited as you suggest.
I think Anthropic should either heavily plan for scenarios where it self-destructs as an organization or should be prepared to proceed with a plan B that doesn’t meet an objectively reasonable safety bar (e.g. <1% lifetime takeover risk). (Or both.)
I think you should instead plan on not building such systems as there isn’t a clear reason why you need such systems and they seem super dangerous. That’s not to say that you shouldn’t also do research into aligning such systems, I just think the focus should instead be on measures to avoid needing to build them.
I lean against focusing on AI welfare as a major priority. I’d target perhaps 2.5% of resources on AI welfare in the absence of outside pressure. (Zach also made a comment on this and my earlier post on AI welfare talks about resource allocation considerations some.)
I agree with large parts of this comment, but am confused by this:
While I don’t endorse it due to disagreeing with some (stated and unstated) premises, I think there’s a locally valid line of reasoning that goes something like this:
if Anthropic finds itself in a world where it’s successfully built not-vastly-superhuman TAI, it seems pretty likely that other actors have also done so, or will do so relatively soon
it is now legible (to those paying attention) that we are in the Acute Risk Period
most other actors who have or will soon have TAI will be less safety-conscious than Anthropic
if nobody ends the Acute Risk Period, it seems pretty likely that one of those actors will do something stupid (like turn over their AI R&D efforts to their unaligned TAI), and then we all die
not-vastly-superhuman TAI will not be sufficient to prevent those actors from doing something stupid that ends the world
unfortunately, it seems like we have no choice but to make sure we’re the first to build superhuman TAI, to make sure the Acute Risk Period has a good ending
This seems like the pretty straightforward argument for racing, and if you have a pretty specific combination of beliefs about alignment difficulty, coordination difficulty, capability profiles, etc, I think it basically checks out.
I don’t know what set of beliefs implies that it’s much more important to avoid building superhuman TAI once you have just-barely TAI, than to avoid building just-barely TAI in the first place. (In particular, how does this end up with the world in a stable equilibrium that doesn’t immediately get knocked over by the second actor to reach TAI?)
AIs which aren’t qualitatively much smarter than humans seem plausible to use reasonably effectively while keeping risk decently low (though still unacceptably risky in objective/absolute terms). Keeping risk low seems to require substantial effort, though it seems maybe achievable. Even with token effort, I think risk is “only” around 25% with such AIs because default methods likely avoid egregious misalignment (perhaps 30% chance of egregious misalignment with token effort and then some chance you get lucky for a 25% chance of risk overall).
Then given this, I have two objections to the story you seem to present:
AIs which aren’t qualitatively smarter than humans seem very useful and with some US government support could suffice to prevent proliferation. (Both greatly reduce the cost of non-proliferation while also substantially increasing willingness to pay with demos etc.)
Plans that don’t involve US government support while building crazy weapons/defense with wildly superhuman AIs involve commiting massive crimes and I think we should have a policy against this.
Another way to put this is that the story for needing much smarter AIs is presumably that you need to build crazy weapons/defenses to defend against someone else’s crazily powerful AI. Building insane weapons/defenses requires US government consent (unless you’re commiting massive crimes which seems like a bad idea). Thus, you might as well go all the way to preventing much smarter AIs from being built (by anyone) for a while which seems possible with some US government support and the use of these human-ish level AIs.
(Responding in a consolidated way just to this comment.)
Ok, got it. I don’t think the US government will be able and willing to coordinate and enforce a worldwide moratorium on superhuman TAI development, if we get to just-barely TAI, at least not without plans that leverage that just-barely TAI in unsafe ways which violate the safety invariants of this plan. It might become more willing than it is now (though I’m not hugely optimistic), but I currently don’t think as an institution it’s capable of executing on that kind of plan and don’t see why that will change in the next five years.
I think I disagree with the framing (“crazy weapons/defenses”) but it does seem like you need some kind of qualitatively new technology. This could very well be social technology, rather than something more material.
I don’t think this is actually true, except in the trivial sense where we have a legal system that allows the government to decide approximately arbitrary behaviors are post-facto illegal if it feels strongly enough about it. Most new things are not explicitly illegal. But even putting that aside[1], I think this is ignoring the legal routes by which a qualitatively superhuman TAI might find to ending the Acute Risk Period, if it was so motivated.
(A reminder that I am not claiming this is Anthropic’s plan, nor would I endorse someone trying to build ASI to execute on this kind of plan.)
I think there’s a very large difference between plans that involve nominal US government signoff on private actors doing things, in order to avoid comitting massive crimes (or to avoid the appearance of doing so), plans that involve the US government mostly just slowing things down or stopping people from doing things, and plans that involve the US government actually being the entity that makes high-context decisions about e.g. what values to to optimize for, given a slot into which to put values.
I agree that stories which require building things that look very obviously like “insane weapons/defenses” seem bad, both for obvious deontological reasons, but also I wouldn’t expect them to work well enough be worth it even under “naive” consequentialist analysis.
I’m basically imagining being able to use controlled AIs which aren’t qualitatively smarter than humans for whatever R&D purposes we want. (Though not applications like (e.g.) using smart AIs to pilot drone armies live.) Some of these applications will be riskier than others, but I think this can be done while managing risk to a moderate degree.
Bootstrapping to some extent should also be possible where you use the first controlled AIs to improve the safety of later deployments (both improving control and possibly alignment).
Is your perspective something like:
I think that even having the ability to easily take over the world as a private actor is pretty norms violating. I’m unsure about the claim that if you put this aside, there is a way to end the acute risk period (edit: without US government collaboration and) without needing truly insanely smart AIs. I suppose that if you go smart enough this is possible though pre-existing norms also just get more confusing in the regime where you can steer the world to whatever outcome you want.
So overall, I’m not sure I disagree with this perspective exactly. I think the overriding consideration for me is that this seems like a crazy and risky proposal at multiple levels.
To be clear, you are explicitly not endorsing this as a plan nor claiming this is Anthropic’s plan.
Something like that, though I’m much less sure about “non-norms-violating”, because many possible solutions seem like they’d involve something qualitatively new (and therefore de-facto norm-violating, like nearly all new technology). Maybe a very superhuman TAI could arrange matters such that things just seem to randomly end up going well rather than badly, without introducing any new[1] social or material technology, but that does seem quite a bit harder.
I’m pretty uncertain about, if something like that ended up looking norm-violating, it’d be norm-violating like Uber was[2], or like super-persuasian. That question seems very contingent on empirical questions that I think we don’t have much insight into, right now.
I didn’t mean to make the claim that there’s a way to end the acute risk period without needing truly insanely smart AIs (if you put aside centrally-illegal methods); rather, that an AI would probably need to be relatively low on the “smarter than humans” scale to need to resort to methods that were obviously illegal to end the acute risk period.
In ways that are obvious to humans.
Minus the part where Uber was pretty obviously illegal in many places where it operated.
My proposal would roughly be that the US government (in collaboration with allies etc) enforces no one building AI which are qualitatively smarter than humans and this should be the default plan.
(This might be doable without government support via coordination between multiple labs, but I basically doubt it.)
Their could be multiple AI projects backed by the US+allies or just one, either could be workable in principle, though multiple seems tricky.
TBC, I don’t think there are plausible alternatives to at least some US government involvement which don’t require commiting a bunch of massive crimes.
I have a policy against commiting or recommending commiting massive crimes.