‘While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.
Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity’
‘Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.’
Current open source models are not themselves any kind of problem. Their availability accelerates timelines, helps with alignment along the way. If there is no moratorium, this might be net positive. If there is a moratorium, it’s certainly net positive, as it’s the kind of research that the moratorium is buying time for, and it doesn’t shorten timelines because they are guarded by the moratorium.
It’s still irreversible proliferation even when the impact is positive. The main issue is open source as an ideology that unconditionally calls for publishing all the things, and refuses to acknowledge the very unusual situations where not publishing things is better than publishing things.
I believe from my work on dangerous capabilities evals that current open source models do provide some small amount of uplift to bad actors. This uplift is increasing much greater than linearly with each new more-capable open source model that is released. If we want to halt this irreversible proliferation before it gets so far that human civilization gets wiped out, we need to act fast on it.
Alignment research is important, but misalignment is not the only threat we face.
One thing that comes to mind is test time compute, and Figure 3 of Language Monkeys paper is quite concerning, where even Pythia-70M (with an “M”) is able to find signal on problems that at first glance are obviously impossible for it to make heads or tails of (see also). If there is an algorithmic unlock, a Llama-3-405B (or Llama-4) might suddenly get much more capable if fed a few orders of magnitude more inference compute than normal. So the current impression about model capabilities can be misleading about what they eventually enable, using future algorithms and still affordable amounts of compute.
Excellent point Vladimir. My team has been thinking a lot about this issue. What if somebody leaked the latest AlphaFold, and instructions on how to make good use of it? If you could feed the instructions into an existing open-source model, and get functional python code out to interact with the private AlphaFold API you set up… That’s a whole lot more dangerous than an LLM alone!
As the whole space of ‘biological design tools’ (h/t Anjali for this term to describe the general concept) gets more capable and complex, the uplift from an LLM that can help you navigate and utilize these tools gets more dangerous. A lot of these computational tools are quite difficult to use effectively for a layperson, yet an AI can handle them fairly easily if given the documentation.
Sure. My specific work is on biorisk evals. See WMDP.ai Closed API models leak a bit of biorisk info, but mainly aren’t that helpful for creating bioweapons (so far as I am able to assess).
Uncensored open-weight models are a whole different ball game. They can be made to be actively quite harm-seeking / terrorist-aligned, and also quite technically competent, such that they make researching and engaging in bioweapons creation substantially easier. The more capable the models get, the more reliable their information is, the higher quality their weapon design ideation is, the more wetlab problems they are able to help you troubleshoot, the more correct their suggestions for getting around government controls and detection methods… And so on. I am saying that this capability is increasing non-linearly in terms of expected risk/harms as open-weight model capabilities increase.
Part of this is that they cross some accuracy/reliability threshold on increasingly tricky tasks where you can at least generate a bunch of generations and pick the most agree-upon idea. Whereas, if you see that there is correct advice given, but only around 5% of the time, then you can be pretty sure that someone who didn’t know that that was the correct advice would know how to pick out the correct advice. As soon as the most frequent opinion is correct, you are in a different sort of hazard zone (e.g. imagine if a single correct idea came up 30% of the time to a particular question, and all the other ideas, most of them incorrect, came up 5% of the time each).
Also it matters a lot whether the model routinely fails at ‘critical fail steps’ versus ‘cheap retry steps’. There are a lot of tricky wetlab steps where you can easily fail, and the model can provide only limited help, but the reason for failure is clear after a bit of internet research and doesn’t waste much in the way of resources. Such ‘noncritical failures’ are very different from ‘critical failures’ such as failing to obfuscate your internet purchases such that a government agency detects your activity and decides to investigate you more closely.
Thanks! How optimistic/excited would you be about research in the spirit of Tamper-Resistant Safeguards for Open-Weight LLMs, especially given that banning open-weight models seems politically unlikely, at least for now?
Extremely excited about the idea of such research succeeding in the near future! But skeptical that it will succeed in time to be at all relevant. So my overall expected value for that direction is low.
Also, I think there’s probably a very real risk that the bird has already flown the coop on this. If you can cheaply modify existing open-weight models to be ‘intent-aligned’ with terrorists, and to be competent at using scaffolding that you have built around ‘biological design tools’… then the LLM isn’t really a bottleneck anymore. The irreversible proliferation has occurred already. I’m not certain this is the case, but I’d give it about 75%.
So then you need to make sure that better biological design tools don’t get released, and that more infohazardous virology papers don’t get published, and that wetlab automation tech doesn’t get better, and… the big one.… that nobody releases an open-weight LLM so capable that it can successfully create tailor-made biological design tools. That’s a harder thing to censor out of an LLM than getting it to directly not help with biological weapons! Creation of biological design tools touches on a lot more things, like its machine learning knowledge and coding skill. What exactly do you censor to make a model helpful at building purely-good tools but not at building tools which have dual-use?
Basically, I think it’s a low-return area entirely. I think humanity’s best bet is in generalized biodefenses, plus an international ‘Council of Guardians’ which use strong tool-AI to monitor the entire world and enforce a ban on:
a) self-replicating weapons (e.g. bioweapons, nanotech)
b) unauthorized recursive-self-improving AI
Of these threats only bioweapons are currently at large. The others are future threats.
Very plausible view (though doesn’t seem to address misuse risks enough, I’d say) in favor of open-sourced models being net positive (including for alignment) from https://www.beren.io/2023-11-05-Open-source-AI-has-been-vital-for-alignment/:
‘While the labs certainly perform much valuable alignment research, and definitely contribute a disproportionate amount per-capita, they cannot realistically hope to compete with the thousands of hobbyists and PhD students tinkering and trying to improve and control models. This disparity will only grow larger as more and more people enter the field while the labs are growing at a much slower rate. Stopping open-source ‘proliferation’ effectively amounts to a unilateral disarmament of alignment while ploughing ahead with capabilities at full-steam.
Thus, until the point at which open source models are directly pushing the capabilities frontier themselves then I consider it extremely unlikely that releasing and working on these models is net-negative for humanity’
‘Much capabilities work involves simply gathering datasets or testing architectures where it is easy to utilize other closed models referenced in pappers or through tacit knowledge of employees. Additionally, simple API access to models is often sufficient to build most AI-powered products rather than direct access to model internals. Conversely, such access is usually required for alignment research. All interpretability requires access to model internals almost by definition. Most of the AI control and alignment techniques we have invented require access to weights for finetuning or activations for runtime edits. Almost nothing can be done to align a model through access to the I/O API of a model at all. Thus it seems likely to me that by restricting open-source we differetially cripple alignment rather than capabilities. Alignment research is more fragile and dependent on deep access to models than capabilities research.’
Current open source models are not themselves any kind of problem. Their availability accelerates timelines, helps with alignment along the way. If there is no moratorium, this might be net positive. If there is a moratorium, it’s certainly net positive, as it’s the kind of research that the moratorium is buying time for, and it doesn’t shorten timelines because they are guarded by the moratorium.
It’s still irreversible proliferation even when the impact is positive. The main issue is open source as an ideology that unconditionally calls for publishing all the things, and refuses to acknowledge the very unusual situations where not publishing things is better than publishing things.
I believe from my work on dangerous capabilities evals that current open source models do provide some small amount of uplift to bad actors. This uplift is increasing much greater than linearly with each new more-capable open source model that is released. If we want to halt this irreversible proliferation before it gets so far that human civilization gets wiped out, we need to act fast on it.
Alignment research is important, but misalignment is not the only threat we face.
One thing that comes to mind is test time compute, and Figure 3 of Language Monkeys paper is quite concerning, where even Pythia-70M (with an “M”) is able to find signal on problems that at first glance are obviously impossible for it to make heads or tails of (see also). If there is an algorithmic unlock, a Llama-3-405B (or Llama-4) might suddenly get much more capable if fed a few orders of magnitude more inference compute than normal. So the current impression about model capabilities can be misleading about what they eventually enable, using future algorithms and still affordable amounts of compute.
Excellent point Vladimir. My team has been thinking a lot about this issue. What if somebody leaked the latest AlphaFold, and instructions on how to make good use of it? If you could feed the instructions into an existing open-source model, and get functional python code out to interact with the private AlphaFold API you set up… That’s a whole lot more dangerous than an LLM alone!
As the whole space of ‘biological design tools’ (h/t Anjali for this term to describe the general concept) gets more capable and complex, the uplift from an LLM that can help you navigate and utilize these tools gets more dangerous. A lot of these computational tools are quite difficult to use effectively for a layperson, yet an AI can handle them fairly easily if given the documentation.
Hmm, I’d be curious if you can share more, especially on the gradient of the uplift with new models.
Sure. My specific work is on biorisk evals. See WMDP.ai
Closed API models leak a bit of biorisk info, but mainly aren’t that helpful for creating bioweapons (so far as I am able to assess).
Uncensored open-weight models are a whole different ball game. They can be made to be actively quite harm-seeking / terrorist-aligned, and also quite technically competent, such that they make researching and engaging in bioweapons creation substantially easier. The more capable the models get, the more reliable their information is, the higher quality their weapon design ideation is, the more wetlab problems they are able to help you troubleshoot, the more correct their suggestions for getting around government controls and detection methods… And so on. I am saying that this capability is increasing non-linearly in terms of expected risk/harms as open-weight model capabilities increase.
Part of this is that they cross some accuracy/reliability threshold on increasingly tricky tasks where you can at least generate a bunch of generations and pick the most agree-upon idea. Whereas, if you see that there is correct advice given, but only around 5% of the time, then you can be pretty sure that someone who didn’t know that that was the correct advice would know how to pick out the correct advice. As soon as the most frequent opinion is correct, you are in a different sort of hazard zone (e.g. imagine if a single correct idea came up 30% of the time to a particular question, and all the other ideas, most of them incorrect, came up 5% of the time each).
Also it matters a lot whether the model routinely fails at ‘critical fail steps’ versus ‘cheap retry steps’. There are a lot of tricky wetlab steps where you can easily fail, and the model can provide only limited help, but the reason for failure is clear after a bit of internet research and doesn’t waste much in the way of resources. Such ‘noncritical failures’ are very different from ‘critical failures’ such as failing to obfuscate your internet purchases such that a government agency detects your activity and decides to investigate you more closely.
Thanks! How optimistic/excited would you be about research in the spirit of Tamper-Resistant Safeguards for Open-Weight LLMs, especially given that banning open-weight models seems politically unlikely, at least for now?
Extremely excited about the idea of such research succeeding in the near future! But skeptical that it will succeed in time to be at all relevant. So my overall expected value for that direction is low.
Also, I think there’s probably a very real risk that the bird has already flown the coop on this. If you can cheaply modify existing open-weight models to be ‘intent-aligned’ with terrorists, and to be competent at using scaffolding that you have built around ‘biological design tools’… then the LLM isn’t really a bottleneck anymore. The irreversible proliferation has occurred already. I’m not certain this is the case, but I’d give it about 75%.
So then you need to make sure that better biological design tools don’t get released, and that more infohazardous virology papers don’t get published, and that wetlab automation tech doesn’t get better, and… the big one.… that nobody releases an open-weight LLM so capable that it can successfully create tailor-made biological design tools. That’s a harder thing to censor out of an LLM than getting it to directly not help with biological weapons! Creation of biological design tools touches on a lot more things, like its machine learning knowledge and coding skill. What exactly do you censor to make a model helpful at building purely-good tools but not at building tools which have dual-use?
Basically, I think it’s a low-return area entirely. I think humanity’s best bet is in generalized biodefenses, plus an international ‘Council of Guardians’ which use strong tool-AI to monitor the entire world and enforce a ban on:
a) self-replicating weapons (e.g. bioweapons, nanotech)
b) unauthorized recursive-self-improving AI
Of these threats only bioweapons are currently at large. The others are future threats.