When thinking about different techniques that aim to reduce AI risk from misaligned AIs, I find it helpful to explicitly consider multiple different levels of risk that the deployer of the AI might be going for, because different classes of safety techniques are appropriate for different risk levels. Four that I think it’s particularly useful to think about:
*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting to have <1% chance that their AIs escape in the first year of deployment, and <5% conditional on the model trying pretty hard to subvert their safety measures. (I think some AI developers are hoping to get lower levels of risk than this.)
*The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.
*The rushed unreasonable developer regime.* You’re advising an unreasonable AI developer that is incompetent at implementing safety measures and unwilling to put much effort into implementing these measures.
*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs. Obviously this is a regime where you should be much more aggressive.
Another option is the extreme safety regime. Sometimes people talk about approaches to AI safety that aim to ensure that an AI takeover is basically inconceivable. I think that this is highly impractical. And at the risk of saying something controversial and off-topic that I’m not confident in, I’m also not sure that it is actually a healthy way to relate to a situation as confusing as the transformation of the world from development of powerful AI. I am wary of attitudes to the world that would have led to opposing the creating of the internet, the printing press, or the industrial revolution, because I think that those things seem pretty good even though ex ante they looked pretty unpredictable. I’m in favor of trying to get particular sources of risk down to extremely low levels (e.g. I don’t mind pushing to reduce asteriod risk by an OOM, and I don’t mind trying to improve techniques that narrowly reduce a certain class of takeover risk), but I don’t love taking this attitude to the whole AI situation. My intuition here is related to Richard Ngo’s “it’s never your job to “ensure” that a large-scale risk doesn’t occur” but isn’t exactly the same.
*The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.
I object to the use of the word “reasonable” here, for similar reasons I object to Anthropic’s use of the word “responsible.” Like, obviously it could be the case that e.g. it’s simply intractable to substantially reduce the risk of disaster, and so the best available move is marginal triage; this isn’t my guess, but I don’t object to the argument. But it feels important to me to distinguish strategies that aim to be “marginally less disastrous” from those which aim to be “reasonable” in an absolute sense, and I think strategies that involve creating a superintelligence without erring much on the side of caution generally seem more like the former sort.
I think it makes sense to use the word “reasonable” to describe someone who is taking actions that minimize total risk, even if those actions aren’t what they’d take in a different situation, and even if various actors had made mistakes to get them into this situation.
(Also note that I’m not talking about making wildly superintelligent AI, I’m just talking about making AGI; my guess is that even when you’re pretty rushed you should try to avoid making galaxy-brained superintelligence.)
I agree it seems good to minimize total risk, even when the best available actions are awful; I think my reservation is mainly that in most such cases, it seems really important to say you’re in that position, so others don’t mistakenly conclude you have things handled. And I model AGI companies as being quite disincentivized from admitting this already—and humans generally as being unreasonably disinclined to update that weird things are happening—so I feel wary of frames/language that emphasize local relative tradeoffs, thereby making it even easier to conceal the absolute level of danger.
My preferred aim is to just need the first process that creates the first astronomically significant AI to follow the approach.[1] To the extent this was not included, I think this list is incomplete, which could make it misleading.
This could (depending on the requirements of the alignment approach) be more feasible when there’s a knowledge gap between labs, if that means more of an alignment tax is tolerable by the top one (more time to figure out how to make the aligned AI also be superintelligent despite the ‘tax’); but I’m not advocating for labs to race to be in that spot (and it’s not the case that all possible alignment approaches would be for systems of the kind that their private-capabilities-knowledge is about (e.g., LLMs).)
*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs.
Does this seem likely to you, or just an interesting edge case or similar? It’s hard for me to imagine realistic-seeming scenarios where e.g. the United States ends up in a war where losing would be comparably bad to AI takeover. This is mostly because ~no functional states (certainly no great powers) strike me as so evil that I’d prefer extinction AI takeover to those states becoming a singleton, and for basically all wars where I can imagine being worried about this—e.g. with North Korea, ISIS, Juergen Schmidhuber—I would expect great powers to be overwhelmingly likely to win. (At least assuming they hadn’t already developed decisively-powerful tech, but that’s presumably the case if a war is happening).
A war against rogue AIs feels like the central case of an existential war regime to me. I think a reasonable fraction of worlds where misalignment causes huge problems could have such a war.
It sounds like you think it’s reasonably likely we’ll end up in a world with rogue AI close enough in power to humanity/states to be competitive in war, yet not powerful enough to quickly/decisively win? If so I’m curious why; this seems like a pretty unlikely/unstable equilibrium to me, given how much easier it is to improve AI systems than humans.
I think having this equilibrium for a while (e.g. a few years) is plausible because humans will also be able to use AI systems. (Humans might also not want to build much more powerful AIs due to safety concerns and simultaneously be able to substantially slow down self-improvement with compute limitations (and track self-improvement using other means).)
Note that by “war” I don’t neccessarily mean that battles are ongoing. It is possible this mostly manifests as racing on scaling and taking aggressive actions to hobble the AI’s ability to use more compute (including via the use of the army and weapons etc.).
Your comment seems to assume that AI takeover will lead to extinction. I don’t think this is a good thing to assume as it seems unlikely to me. (To be clear, I think AI takeover is very bad and might result in huge numbers of human deaths.)
When thinking about different techniques that aim to reduce AI risk from misaligned AIs, I find it helpful to explicitly consider multiple different levels of risk that the deployer of the AI might be going for, because different classes of safety techniques are appropriate for different risk levels. Four that I think it’s particularly useful to think about:
*The “safety case” regime.* Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting to have <1% chance that their AIs escape in the first year of deployment, and <5% conditional on the model trying pretty hard to subvert their safety measures. (I think some AI developers are hoping to get lower levels of risk than this.)
*The rushed reasonable developer regime.* The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.
*The rushed unreasonable developer regime.* You’re advising an unreasonable AI developer that is incompetent at implementing safety measures and unwilling to put much effort into implementing these measures.
*The existential war regime*. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs. Obviously this is a regime where you should be much more aggressive.
Another option is the extreme safety regime. Sometimes people talk about approaches to AI safety that aim to ensure that an AI takeover is basically inconceivable. I think that this is highly impractical. And at the risk of saying something controversial and off-topic that I’m not confident in, I’m also not sure that it is actually a healthy way to relate to a situation as confusing as the transformation of the world from development of powerful AI. I am wary of attitudes to the world that would have led to opposing the creating of the internet, the printing press, or the industrial revolution, because I think that those things seem pretty good even though ex ante they looked pretty unpredictable. I’m in favor of trying to get particular sources of risk down to extremely low levels (e.g. I don’t mind pushing to reduce asteriod risk by an OOM, and I don’t mind trying to improve techniques that narrowly reduce a certain class of takeover risk), but I don’t love taking this attitude to the whole AI situation. My intuition here is related to Richard Ngo’s “it’s never your job to “ensure” that a large-scale risk doesn’t occur” but isn’t exactly the same.
I object to the use of the word “reasonable” here, for similar reasons I object to Anthropic’s use of the word “responsible.” Like, obviously it could be the case that e.g. it’s simply intractable to substantially reduce the risk of disaster, and so the best available move is marginal triage; this isn’t my guess, but I don’t object to the argument. But it feels important to me to distinguish strategies that aim to be “marginally less disastrous” from those which aim to be “reasonable” in an absolute sense, and I think strategies that involve creating a superintelligence without erring much on the side of caution generally seem more like the former sort.
I think it makes sense to use the word “reasonable” to describe someone who is taking actions that minimize total risk, even if those actions aren’t what they’d take in a different situation, and even if various actors had made mistakes to get them into this situation.
(Also note that I’m not talking about making wildly superintelligent AI, I’m just talking about making AGI; my guess is that even when you’re pretty rushed you should try to avoid making galaxy-brained superintelligence.)
I agree it seems good to minimize total risk, even when the best available actions are awful; I think my reservation is mainly that in most such cases, it seems really important to say you’re in that position, so others don’t mistakenly conclude you have things handled. And I model AGI companies as being quite disincentivized from admitting this already—and humans generally as being unreasonably disinclined to update that weird things are happening—so I feel wary of frames/language that emphasize local relative tradeoffs, thereby making it even easier to conceal the absolute level of danger.
Yep that’s very fair. I agree that it’s very likely that AI companies will continue to be misleading about the absolute risk posed by their actions.
My preferred aim is to just need the first process that creates the first astronomically significant AI to follow the approach.[1] To the extent this was not included, I think this list is incomplete, which could make it misleading.
This could (depending on the requirements of the alignment approach) be more feasible when there’s a knowledge gap between labs, if that means more of an alignment tax is tolerable by the top one (more time to figure out how to make the aligned AI also be superintelligent despite the ‘tax’); but I’m not advocating for labs to race to be in that spot (and it’s not the case that all possible alignment approaches would be for systems of the kind that their private-capabilities-knowledge is about (e.g., LLMs).)
Does this seem likely to you, or just an interesting edge case or similar? It’s hard for me to imagine realistic-seeming scenarios where e.g. the United States ends up in a war where losing would be comparably bad to AI takeover. This is mostly because ~no functional states (certainly no great powers) strike me as so evil that I’d prefer
extinctionAI takeover to those states becoming a singleton, and for basically all wars where I can imagine being worried about this—e.g. with North Korea, ISIS, Juergen Schmidhuber—I would expect great powers to be overwhelmingly likely to win. (At least assuming they hadn’t already developed decisively-powerful tech, but that’s presumably the case if a war is happening).A war against rogue AIs feels like the central case of an existential war regime to me. I think a reasonable fraction of worlds where misalignment causes huge problems could have such a war.
It sounds like you think it’s reasonably likely we’ll end up in a world with rogue AI close enough in power to humanity/states to be competitive in war, yet not powerful enough to quickly/decisively win? If so I’m curious why; this seems like a pretty unlikely/unstable equilibrium to me, given how much easier it is to improve AI systems than humans.
I think having this equilibrium for a while (e.g. a few years) is plausible because humans will also be able to use AI systems. (Humans might also not want to build much more powerful AIs due to safety concerns and simultaneously be able to substantially slow down self-improvement with compute limitations (and track self-improvement using other means).)
Note that by “war” I don’t neccessarily mean that battles are ongoing. It is possible this mostly manifests as racing on scaling and taking aggressive actions to hobble the AI’s ability to use more compute (including via the use of the army and weapons etc.).
Your comment seems to assume that AI takeover will lead to extinction. I don’t think this is a good thing to assume as it seems unlikely to me. (To be clear, I think AI takeover is very bad and might result in huge numbers of human deaths.)
I do basically assume this, but it isn’t cruxy so I’ll edit.