I’m continuing to contribute to work on biosafety evals inspired by this work. I think there is a high level point here to be made about safety evals.
If you want to evaluate how dangerous a model is, you need to at least consider how dangerous its weights would be in the hands of bad actors. A lot of dangers become much worse once the simulated bad actor has the ability to fine-tune the model. If your evals don’t include letting the Red Teamers fine-tune your model and use it through an unfiltered API, then your evals are missing this aspect. (This doesn’t mean you would need to directly expose the weights to the Red Teamers, just that they’d need to be able to submit a dataset and hyperparameters and you’d need to then provide an unfiltered API to the resulting fine-tuned version.)
I’m continuing to contribute to work on biosafety evals inspired by this work. I think there is a high level point here to be made about safety evals.
If you want to evaluate how dangerous a model is, you need to at least consider how dangerous its weights would be in the hands of bad actors. A lot of dangers become much worse once the simulated bad actor has the ability to fine-tune the model. If your evals don’t include letting the Red Teamers fine-tune your model and use it through an unfiltered API, then your evals are missing this aspect. (This doesn’t mean you would need to directly expose the weights to the Red Teamers, just that they’d need to be able to submit a dataset and hyperparameters and you’d need to then provide an unfiltered API to the resulting fine-tuned version.)