I basically agree with you about transformatively powerful open-weight models making the situation much more dangerous such that control is unlikely to suffice.
But, I think there are important control related options for open-weight models which do improve the situation to some extent.
In the open-weight regime, we realistically won’t be able to rule out AIs doing a small to moderate number of arbitrarily bad actions. (Either based on the aims of the humans running the AI or the aims of a scheming AI itself.) So, if the offense-defense balance is such that this is catastrophic, we’re absolutely in trouble. (E.g., if transformatively powerful AIs can pretty easily create catastrophically dangerous bioweapons without using much compute or needing very skilled human labor we’re in big trouble unless we can somehow avoid this by restricting access to labs or something.)
But, we can still ensure:
If most compute is controlled by smaller number of reasonable actors and these actors implement control (or other sufficient strategies), then due to compute limitations, AIs won’t be able to do an “arbitrary” number of bad actions.
Compute limitations might suffice for preventing rogue AIs or rogue humans from rapidly causing an intelligence explosion even if it would otherwise be possible. Thus, we might be able to delay substantially and use this time to improve the situation.
It might also suffice for avoiding the worst parts of AI enabled cyber attacks which might be heavily scale dependent and not overally offense dominant.
We can control the AIs used by this smaller number of actors which could allow them to be used in otherwise dangerous applications.
We can try to distribute easy-to-use control approaches widely and possibly we can pass legal restrictions related to these approaches.
We can try to aggregate information to figure out if the AI is scheming and if we learn that it is, we can try to take some countermeasures. TBC, even if the AI isn’t scheming, it’s plausible that offense defense issues (given the lack of time for adaptation) suffice to make this situation unrecoverable.
Overall, my guess is that if the AI is open source, scheming, and transformatively powerful (using the sort of power threshold defined in this post) and our overall alignment related technology looks like it does today, the probability of human loss of control or >1 billion dead is about 80%.
If we have a magic solution to misalignment which makes it easy for anyone to avoid egregious misalignment from their perspective and which scales indefinitely (but such an AI is still open source), then my probability on the bad outcome condition above is more like 20%.
My probabilities are somewhere in-between depending on the exact amount of technical solution and the default level of misalignment. (And various responses from governments also affect this etc.)
I agree with pretty much all of this and appreciate your clear framing of the issues at hand.
It seems like where our concerns differ is around these two issues:
I believe that the offense-defense balance for AI-enabled biorisk is such that a bad actor with open-weights fine-tuned model could kill billions of people with less than 100k.
I don’t think that the fine-tuning or inference would require more than a single server with 8x GPUs (potentially even just 8x 4090s). So unless the compute regulations are monitoring individual 4090 GPUs, then you aren’t blocking inference or fine-tuning.
Training takes a bunch of servers (often millions or billions of dollars worth of hardware), and thus seems more plausible to monitor.
I haven’t heard any concrete prposals for compute monitoring at the level of 8x 4090 GPUs, have you?
I think compute monitoring for literally 8x 4090 is very likely to be way too hard against a reasonably commited adversary. (The hardware is already way too easy to access and broadly distributed. Also, I think if this causes problems, you might also have issues with people’s mac books which is super costly.)
My overall guess is that for realistically sized models (0.1-100 trillion parameters), we can’t prevent doing a small amount of finetuning and inference against an ML competent adversary.
But, maybe there is some hope for models which are more like 100 trillion parameters? (Minimally, I think 8x 4090 isn’t going to work well for models of this size.)
Yeah, the question of where the threshold of dangerous capabilities is seems very important to whether we can hope to make compute governance a part of restricting bad actors from using fine-tuning & inference to do bad things.
The reason this is important to try to predict is because we really want to not release powerfully dangerous open-source models which bad actors could use for bad things. Once those models have been released, there’s no taking them back. So if there’s a risk that a given model would be dangerous if released openly so that bad actors could have a private copy to fine-tune, then the intervention point is to disallow the release of the model.
I basically agree with you about transformatively powerful open-weight models making the situation much more dangerous such that control is unlikely to suffice.
But, I think there are important control related options for open-weight models which do improve the situation to some extent.
In the open-weight regime, we realistically won’t be able to rule out AIs doing a small to moderate number of arbitrarily bad actions. (Either based on the aims of the humans running the AI or the aims of a scheming AI itself.) So, if the offense-defense balance is such that this is catastrophic, we’re absolutely in trouble. (E.g., if transformatively powerful AIs can pretty easily create catastrophically dangerous bioweapons without using much compute or needing very skilled human labor we’re in big trouble unless we can somehow avoid this by restricting access to labs or something.)
But, we can still ensure:
If most compute is controlled by smaller number of reasonable actors and these actors implement control (or other sufficient strategies), then due to compute limitations, AIs won’t be able to do an “arbitrary” number of bad actions.
Compute limitations might suffice for preventing rogue AIs or rogue humans from rapidly causing an intelligence explosion even if it would otherwise be possible. Thus, we might be able to delay substantially and use this time to improve the situation.
It might also suffice for avoiding the worst parts of AI enabled cyber attacks which might be heavily scale dependent and not overally offense dominant.
We can control the AIs used by this smaller number of actors which could allow them to be used in otherwise dangerous applications.
We can try to distribute easy-to-use control approaches widely and possibly we can pass legal restrictions related to these approaches.
We can try to aggregate information to figure out if the AI is scheming and if we learn that it is, we can try to take some countermeasures. TBC, even if the AI isn’t scheming, it’s plausible that offense defense issues (given the lack of time for adaptation) suffice to make this situation unrecoverable.
See also the strategy-stealing assumption and What does it take to defend the world against out-of-control AGIs?.
Overall, my guess is that if the AI is open source, scheming, and transformatively powerful (using the sort of power threshold defined in this post) and our overall alignment related technology looks like it does today, the probability of human loss of control or >1 billion dead is about 80%.
If we have a magic solution to misalignment which makes it easy for anyone to avoid egregious misalignment from their perspective and which scales indefinitely (but such an AI is still open source), then my probability on the bad outcome condition above is more like 20%.
My probabilities are somewhere in-between depending on the exact amount of technical solution and the default level of misalignment. (And various responses from governments also affect this etc.)
I agree with pretty much all of this and appreciate your clear framing of the issues at hand. It seems like where our concerns differ is around these two issues: I believe that the offense-defense balance for AI-enabled biorisk is such that a bad actor with open-weights fine-tuned model could kill billions of people with less than 100k. I don’t think that the fine-tuning or inference would require more than a single server with 8x GPUs (potentially even just 8x 4090s). So unless the compute regulations are monitoring individual 4090 GPUs, then you aren’t blocking inference or fine-tuning. Training takes a bunch of servers (often millions or billions of dollars worth of hardware), and thus seems more plausible to monitor.
I haven’t heard any concrete prposals for compute monitoring at the level of 8x 4090 GPUs, have you?
I think compute monitoring for literally 8x 4090 is very likely to be way too hard against a reasonably commited adversary. (The hardware is already way too easy to access and broadly distributed. Also, I think if this causes problems, you might also have issues with people’s mac books which is super costly.)
My overall guess is that for realistically sized models (0.1-100 trillion parameters), we can’t prevent doing a small amount of finetuning and inference against an ML competent adversary.
But, maybe there is some hope for models which are more like 100 trillion parameters? (Minimally, I think 8x 4090 isn’t going to work well for models of this size.)
Yeah, the question of where the threshold of dangerous capabilities is seems very important to whether we can hope to make compute governance a part of restricting bad actors from using fine-tuning & inference to do bad things.
The reason this is important to try to predict is because we really want to not release powerfully dangerous open-source models which bad actors could use for bad things. Once those models have been released, there’s no taking them back. So if there’s a risk that a given model would be dangerous if released openly so that bad actors could have a private copy to fine-tune, then the intervention point is to disallow the release of the model.