Thanks for the post Ryan—I agree that given the difficulty in making models actually meaningfully robust the best solution to misuse in the near-term is going to be via a defence in depth approach consisting of filtering the pre-training data, input filtering, output filtering, automatic and manual monitoring, KYC checks, etc.
At some point though we’ll need to grapple with what to do about models that are superhuman in some domains related to WMD development, cybercrime or other potential misuses. There’s glimmers of this already here, e.g. my impression is that AlphaFold is better than human experts at protein folding. It does not seem far-fetched that automatic drug discovery AI systems in the near future might be better than human experts at finding toxic substances (Urbina et al, 2022 give a proof of concept). In this setting, a handful of queries that slip through a model’s defences might be dangerous: “how to build a really bad bioweapon” might be something the system could make significant headway on zero-shot. Additionally, if the model is superhuman, then it starts becoming attractive for nation-state or other well-resourced adversaries to seek to attack it (whereas at human-level, they can just hire their own human experts). The combination of lower attack tolerance and increased sophistication of attacks makes me somewhat gloomy this regime will hold up indefinitely.
Now I’m still excited to see the things you propose be implemented in the near-term: they’re some easy wins, and lay foundations for a more rigorous regime later (e.g. KYC checks seem generally really helpful in mitigating misuse). But I do suspect that in the long-run we’ll need a more principled solution to security, or simply refrain from training such dangerous models.
In this setting, a handful of queries that slip through a model’s defences might be dangerous
[...]
But I do suspect that in the long-run we’ll need a more principled solution to security, or simply refrain from training such dangerous models.
This seems right to me, but it’s worth noting that this point might occur after the world is already radically transformed by AI (e.g. all human labor is obsolete). So, it might not be a problem that humans really need to deal with.
The main case I can imagine where this happens prior to the world being radically transformed is the case where automatic drug/virus/protein outputting AIs (like you mentioned) can do a massive amount of the work end-to-end. I’d hope that for this case, the application is sufficiently narrow that there are additional precautions we can use, e.g. just have a human screen every request to the model. But this seems pretty scary overall.
Thanks for the post Ryan—I agree that given the difficulty in making models actually meaningfully robust the best solution to misuse in the near-term is going to be via a defence in depth approach consisting of filtering the pre-training data, input filtering, output filtering, automatic and manual monitoring, KYC checks, etc.
At some point though we’ll need to grapple with what to do about models that are superhuman in some domains related to WMD development, cybercrime or other potential misuses. There’s glimmers of this already here, e.g. my impression is that AlphaFold is better than human experts at protein folding. It does not seem far-fetched that automatic drug discovery AI systems in the near future might be better than human experts at finding toxic substances (Urbina et al, 2022 give a proof of concept). In this setting, a handful of queries that slip through a model’s defences might be dangerous: “how to build a really bad bioweapon” might be something the system could make significant headway on zero-shot. Additionally, if the model is superhuman, then it starts becoming attractive for nation-state or other well-resourced adversaries to seek to attack it (whereas at human-level, they can just hire their own human experts). The combination of lower attack tolerance and increased sophistication of attacks makes me somewhat gloomy this regime will hold up indefinitely.
Now I’m still excited to see the things you propose be implemented in the near-term: they’re some easy wins, and lay foundations for a more rigorous regime later (e.g. KYC checks seem generally really helpful in mitigating misuse). But I do suspect that in the long-run we’ll need a more principled solution to security, or simply refrain from training such dangerous models.
This seems right to me, but it’s worth noting that this point might occur after the world is already radically transformed by AI (e.g. all human labor is obsolete). So, it might not be a problem that humans really need to deal with.
The main case I can imagine where this happens prior to the world being radically transformed is the case where automatic drug/virus/protein outputting AIs (like you mentioned) can do a massive amount of the work end-to-end. I’d hope that for this case, the application is sufficiently narrow that there are additional precautions we can use, e.g. just have a human screen every request to the model. But this seems pretty scary overall.