In this setting, a handful of queries that slip through a model’s defences might be dangerous
[...]
But I do suspect that in the long-run we’ll need a more principled solution to security, or simply refrain from training such dangerous models.
This seems right to me, but it’s worth noting that this point might occur after the world is already radically transformed by AI (e.g. all human labor is obsolete). So, it might not be a problem that humans really need to deal with.
The main case I can imagine where this happens prior to the world being radically transformed is the case where automatic drug/virus/protein outputting AIs (like you mentioned) can do a massive amount of the work end-to-end. I’d hope that for this case, the application is sufficiently narrow that there are additional precautions we can use, e.g. just have a human screen every request to the model. But this seems pretty scary overall.
This seems right to me, but it’s worth noting that this point might occur after the world is already radically transformed by AI (e.g. all human labor is obsolete). So, it might not be a problem that humans really need to deal with.
The main case I can imagine where this happens prior to the world being radically transformed is the case where automatic drug/virus/protein outputting AIs (like you mentioned) can do a massive amount of the work end-to-end. I’d hope that for this case, the application is sufficiently narrow that there are additional precautions we can use, e.g. just have a human screen every request to the model. But this seems pretty scary overall.