Sure. My specific work is on biorisk evals. See WMDP.ai Closed API models leak a bit of biorisk info, but mainly aren’t that helpful for creating bioweapons (so far as I am able to assess).
Uncensored open-weight models are a whole different ball game. They can be made to be actively quite harm-seeking / terrorist-aligned, and also quite technically competent, such that they make researching and engaging in bioweapons creation substantially easier. The more capable the models get, the more reliable their information is, the higher quality their weapon design ideation is, the more wetlab problems they are able to help you troubleshoot, the more correct their suggestions for getting around government controls and detection methods… And so on. I am saying that this capability is increasing non-linearly in terms of expected risk/harms as open-weight model capabilities increase.
Part of this is that they cross some accuracy/reliability threshold on increasingly tricky tasks where you can at least generate a bunch of generations and pick the most agree-upon idea. Whereas, if you see that there is correct advice given, but only around 5% of the time, then you can be pretty sure that someone who didn’t know that that was the correct advice would know how to pick out the correct advice. As soon as the most frequent opinion is correct, you are in a different sort of hazard zone (e.g. imagine if a single correct idea came up 30% of the time to a particular question, and all the other ideas, most of them incorrect, came up 5% of the time each).
Also it matters a lot whether the model routinely fails at ‘critical fail steps’ versus ‘cheap retry steps’. There are a lot of tricky wetlab steps where you can easily fail, and the model can provide only limited help, but the reason for failure is clear after a bit of internet research and doesn’t waste much in the way of resources. Such ‘noncritical failures’ are very different from ‘critical failures’ such as failing to obfuscate your internet purchases such that a government agency detects your activity and decides to investigate you more closely.
Thanks! How optimistic/excited would you be about research in the spirit of Tamper-Resistant Safeguards for Open-Weight LLMs, especially given that banning open-weight models seems politically unlikely, at least for now?
Extremely excited about the idea of such research succeeding in the near future! But skeptical that it will succeed in time to be at all relevant. So my overall expected value for that direction is low.
Also, I think there’s probably a very real risk that the bird has already flown the coop on this. If you can cheaply modify existing open-weight models to be ‘intent-aligned’ with terrorists, and to be competent at using scaffolding that you have built around ‘biological design tools’… then the LLM isn’t really a bottleneck anymore. The irreversible proliferation has occurred already. I’m not certain this is the case, but I’d give it about 75%.
So then you need to make sure that better biological design tools don’t get released, and that more infohazardous virology papers don’t get published, and that wetlab automation tech doesn’t get better, and… the big one.… that nobody releases an open-weight LLM so capable that it can successfully create tailor-made biological design tools. That’s a harder thing to censor out of an LLM than getting it to directly not help with biological weapons! Creation of biological design tools touches on a lot more things, like its machine learning knowledge and coding skill. What exactly do you censor to make a model helpful at building purely-good tools but not at building tools which have dual-use?
Basically, I think it’s a low-return area entirely. I think humanity’s best bet is in generalized biodefenses, plus an international ‘Council of Guardians’ which use strong tool-AI to monitor the entire world and enforce a ban on:
a) self-replicating weapons (e.g. bioweapons, nanotech)
b) unauthorized recursive-self-improving AI
Of these threats only bioweapons are currently at large. The others are future threats.
Hmm, I’d be curious if you can share more, especially on the gradient of the uplift with new models.
Sure. My specific work is on biorisk evals. See WMDP.ai
Closed API models leak a bit of biorisk info, but mainly aren’t that helpful for creating bioweapons (so far as I am able to assess).
Uncensored open-weight models are a whole different ball game. They can be made to be actively quite harm-seeking / terrorist-aligned, and also quite technically competent, such that they make researching and engaging in bioweapons creation substantially easier. The more capable the models get, the more reliable their information is, the higher quality their weapon design ideation is, the more wetlab problems they are able to help you troubleshoot, the more correct their suggestions for getting around government controls and detection methods… And so on. I am saying that this capability is increasing non-linearly in terms of expected risk/harms as open-weight model capabilities increase.
Part of this is that they cross some accuracy/reliability threshold on increasingly tricky tasks where you can at least generate a bunch of generations and pick the most agree-upon idea. Whereas, if you see that there is correct advice given, but only around 5% of the time, then you can be pretty sure that someone who didn’t know that that was the correct advice would know how to pick out the correct advice. As soon as the most frequent opinion is correct, you are in a different sort of hazard zone (e.g. imagine if a single correct idea came up 30% of the time to a particular question, and all the other ideas, most of them incorrect, came up 5% of the time each).
Also it matters a lot whether the model routinely fails at ‘critical fail steps’ versus ‘cheap retry steps’. There are a lot of tricky wetlab steps where you can easily fail, and the model can provide only limited help, but the reason for failure is clear after a bit of internet research and doesn’t waste much in the way of resources. Such ‘noncritical failures’ are very different from ‘critical failures’ such as failing to obfuscate your internet purchases such that a government agency detects your activity and decides to investigate you more closely.
Thanks! How optimistic/excited would you be about research in the spirit of Tamper-Resistant Safeguards for Open-Weight LLMs, especially given that banning open-weight models seems politically unlikely, at least for now?
Extremely excited about the idea of such research succeeding in the near future! But skeptical that it will succeed in time to be at all relevant. So my overall expected value for that direction is low.
Also, I think there’s probably a very real risk that the bird has already flown the coop on this. If you can cheaply modify existing open-weight models to be ‘intent-aligned’ with terrorists, and to be competent at using scaffolding that you have built around ‘biological design tools’… then the LLM isn’t really a bottleneck anymore. The irreversible proliferation has occurred already. I’m not certain this is the case, but I’d give it about 75%.
So then you need to make sure that better biological design tools don’t get released, and that more infohazardous virology papers don’t get published, and that wetlab automation tech doesn’t get better, and… the big one.… that nobody releases an open-weight LLM so capable that it can successfully create tailor-made biological design tools. That’s a harder thing to censor out of an LLM than getting it to directly not help with biological weapons! Creation of biological design tools touches on a lot more things, like its machine learning knowledge and coding skill. What exactly do you censor to make a model helpful at building purely-good tools but not at building tools which have dual-use?
Basically, I think it’s a low-return area entirely. I think humanity’s best bet is in generalized biodefenses, plus an international ‘Council of Guardians’ which use strong tool-AI to monitor the entire world and enforce a ban on:
a) self-replicating weapons (e.g. bioweapons, nanotech)
b) unauthorized recursive-self-improving AI
Of these threats only bioweapons are currently at large. The others are future threats.