Nathan Helm-Burger comments on 2025 Alignment Predictions

Nathan Helm-Burger 2 Jan 2025 18:12 UTC
4 points
0
But you see, that’s exactly the point!

What is the eval trying to measure?
- if you are measuring how safe a model is to deploy as-is behind a closed API, then fine. No fine-tuning evals needed (as long as the API doesn’t offer fine-tuning, in which case you need fine-tuning evals that take the API protections into account.)
- if you are measuring how dangerous the model weights would be if they were stolen and completely under the control of bad actors… Obviously you need fine-tuning evals! Why wouldn’t you expect the bad actors to fine-tune on the task they wanted the model to do well on?!
- if you are measuring how dangerous the model weights will be for a model you intend to publish the weights openly for, same deal. If you wait to do this testing until after you’ve made the weights public, or you make arguments like “we trust the open source community will let us know if they discover anything hazardous”. That’s an argument that makes some sense with an open source code library. Users notice a flaw, they report it, the maintainers patch the flaw, users update to latest version, bug is gone. That is not a good model for how problematic capabilities discovered in open-weight models could be handled.