Nathan Helm-Burger comments on Will releasing the weights of large language models grant widespread access to pandemic agents?

Nathan Helm-Burger 30 Oct 2023 19:18 UTC
7 points
1
I’m continuing to contribute to work on biosafety evals inspired by this work. I think there is a high level point here to be made about safety evals.
If you want to evaluate how dangerous a model is, you need to at least consider how dangerous its weights would be in the hands of bad actors. A lot of dangers become much worse once the simulated bad actor has the ability to fine-tune the model. If your evals don’t include letting the Red Teamers fine-tune your model and use it through an unfiltered API, then your evals are missing this aspect. (This doesn’t mean you would need to directly expose the weights to the Red Teamers, just that they’d need to be able to submit a dataset and hyperparameters and you’d need to then provide an unfiltered API to the resulting fine-tuned version.)