I’m not sure if the definition of takeover-capable-AI (abbreviated as “TCAI” for the rest of this comment) in footnote 2 quite makes sense. I’m worried that too much of the action is in “if no other actors had access to powerful AI systems”, and not that much action is in the exact capabilities of the “TCAI”. In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption “no other actor will have access to powerful AI systems”, they’d have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it’d be right to forecast a >25% chance of them successfully taking over if they were motivated to try.
And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: “Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would” and “via assisting the developers in a power grab, or via partnering with a US adversary”. (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn’t agentic enough to “assist”/”partner” with allies as supposed to just be used as a tool?)
What could a competing definition be? Thinking about what we care most about… I think two events especially stand out to me:
When would it plausibly be catastrophically bad for an adversary to steal an AI model?
When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?
Maybe a better definition would be to directly talk about these two events? So for example...
“Steal is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough security to keep future model weights secure” has significantly less probability of AI-assisted takeover than
“Frontier AI development projects immediately have their weights stolen, and then acquire security that’s just as good as in (1a).”[1]
“Power-seeking and non-controlled is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would’ve been net-negative for them to deploy” has significantly less probability of AI-assisted takeover than
“Frontier AI development acquire the level of judgment described in (2a) 6 months later.”[2]
Where “significantly less probability of AI-assisted takeover” could be e.g. at least 2x less risk.
The motivation for assuming “future model weights secure” in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn’t nullified by the fact that they’re very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can’t contrast 1a’=”model weights are permanently secure” with 1b’=”model weights get stolen and are then default-level-secure”, because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren’t that important.)
The motivation for assuming “good future judgment about power-seeking-risk” is similar to the motivation for assuming “future model weights secure” above. The motivation for choosing “good judgment about when to deploy vs. not” rather than “good at aligning/controlling future models” is that a big threat model is “misaligned AIs outcompete us because we don’t have any competitive aligned AIs, so we’re stuck between deploying misaligned AIs and being outcompeted” and I don’t want to assume away that threat model.
I agree that the notion of takeover-capable AI I use is problematic and makes the situation hard to reason about, but I intentionally rejected the notions you propose as they seemed even worse to think about from my perspective.
Is there some reason for why current AI isn’t TCAI by your definition?
(I’d guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they’re being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don’t think that’s how you meant it.)
Oh sorry. I somehow missed this aspect of your comment.
Here’s a definition of takeover-capable AI that I like: the AI is capable enough that plausible interventions on known human controlled institutions within a few months no longer suffice to prevent plausible takeover. (Which implies that making the situation clear to the world is substantially less useful and human controlled institutions can no longer as easily get a seat at the table.)
Under this definition, there are basically two relevant conditions:
The AI is capable enough to itself take over autonomously. (In the way you defined it, but also not in a way where intervening on human institutions can still prevent the takeover, so e.g.., the AI just having a rogue deployment within OpenAI doesn’t suffice if substantial externally imposed improvements to OpenAI’s security and controls would defeat the takeover attempt.)
Or human groups can do a nearly immediate takeover with the AI such that they could then just resist such interventions.
Hm — what are the “plausible interventions” that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for “plausible”, because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)
I’m not sure if the definition of takeover-capable-AI (abbreviated as “TCAI” for the rest of this comment) in footnote 2 quite makes sense. I’m worried that too much of the action is in “if no other actors had access to powerful AI systems”, and not that much action is in the exact capabilities of the “TCAI”. In particular: Maybe we already have TCAI (by that definition) because if a frontier AI company or a US adversary was blessed with the assumption “no other actor will have access to powerful AI systems”, they’d have a huge advantage over the rest of the world (as soon as they develop more powerful AI), plausibly implying that it’d be right to forecast a >25% chance of them successfully taking over if they were motivated to try.
And this seems somewhat hard to disentangle from stuff that is supposed to count according to footnote 2, especially: “Takeover via the mechanism of an AI escaping, independently building more powerful AI that it controls, and then this more powerful AI taking over would” and “via assisting the developers in a power grab, or via partnering with a US adversary”. (Or maybe the scenario in 1st paragraph is supposed to be excluded because current AI isn’t agentic enough to “assist”/”partner” with allies as supposed to just be used as a tool?)
What could a competing definition be? Thinking about what we care most about… I think two events especially stand out to me:
When would it plausibly be catastrophically bad for an adversary to steal an AI model?
When would it plausibly be catastrophically bad for an AI to be power-seeking and non-controlled?
Maybe a better definition would be to directly talk about these two events? So for example...
“Steal is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough security to keep future model weights secure” has significantly less probability of AI-assisted takeover than
“Frontier AI development projects immediately have their weights stolen, and then acquire security that’s just as good as in (1a).”[1]
“Power-seeking and non-controlled is catastrophic” would be true if...
“Frontier AI development projects immediately acquire good enough judgment about power-seeking-risk that they henceforth choose to not deploy any model that would’ve been net-negative for them to deploy” has significantly less probability of AI-assisted takeover than
“Frontier AI development acquire the level of judgment described in (2a) 6 months later.”[2]
Where “significantly less probability of AI-assisted takeover” could be e.g. at least 2x less risk.
The motivation for assuming “future model weights secure” in both (1a) and (1b) is so that the downside of getting the model weights stolen imminently isn’t nullified by the fact that they’re very likely to get stolen a bit later, regardless. Because many interventions that would prevent model weight theft this month would also help prevent it future months. (And also, we can’t contrast 1a’=”model weights are permanently secure” with 1b’=”model weights get stolen and are then default-level-secure”, because that would already have a really big effect on takeover risk, purely via the effect on future model weights, even though current model weights probably aren’t that important.)
The motivation for assuming “good future judgment about power-seeking-risk” is similar to the motivation for assuming “future model weights secure” above. The motivation for choosing “good judgment about when to deploy vs. not” rather than “good at aligning/controlling future models” is that a big threat model is “misaligned AIs outcompete us because we don’t have any competitive aligned AIs, so we’re stuck between deploying misaligned AIs and being outcompeted” and I don’t want to assume away that threat model.
I agree that the notion of takeover-capable AI I use is problematic and makes the situation hard to reason about, but I intentionally rejected the notions you propose as they seemed even worse to think about from my perspective.
Is there some reason for why current AI isn’t TCAI by your definition?
(I’d guess that the best way to rescue your notion it is to stipulate that the TCAIs must have >25% probability of taking over themselves. Possibly with assistance from humans, possibly by manipulating other humans who think they’re being assisted by the AIs — but ultimately the original TCAIs should be holding the power in order for it to count. That would clearly exclude current systems. But I don’t think that’s how you meant it.)
Oh sorry. I somehow missed this aspect of your comment.
Here’s a definition of takeover-capable AI that I like: the AI is capable enough that plausible interventions on known human controlled institutions within a few months no longer suffice to prevent plausible takeover. (Which implies that making the situation clear to the world is substantially less useful and human controlled institutions can no longer as easily get a seat at the table.)
Under this definition, there are basically two relevant conditions:
The AI is capable enough to itself take over autonomously. (In the way you defined it, but also not in a way where intervening on human institutions can still prevent the takeover, so e.g.., the AI just having a rogue deployment within OpenAI doesn’t suffice if substantial externally imposed improvements to OpenAI’s security and controls would defeat the takeover attempt.)
Or human groups can do a nearly immediate takeover with the AI such that they could then just resist such interventions.
I’ll clarify this in the comment.
Hm — what are the “plausible interventions” that would stop China from having >25% probability of takeover if no other country could build powerful AI? Seems like you either need to count a delay as successful prevention, or you need to have a pretty low bar for “plausible”, because it seems extremely difficult/costly to prevent China from developing powerful AI in the long run. (Where they can develop their own supply chains, put manufacturing and data centers underground, etc.)
Yeah, I’m trying to include delay as fine.
I’m just trying to point at “the point when aggressive intervention by a bunch of parties is potentially still too late”.