anaguma comments on 2025 Alignment Predictions

anaguma 2 Jan 2025 7:11 UTC
3 points
2

Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.

I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increase in performance. This might be true even if the 1B tokens were already in the pretraining dataset, because in some sense it’s the most recent data that the model has seen.
- Nathan Helm-Burger 2 Jan 2025 18:12 UTC
  4 points
  0
  Parent
  But you see, that’s exactly the point!
  
  What is the eval trying to measure?
  - if you are measuring how safe a model is to deploy as-is behind a closed API, then fine. No fine-tuning evals needed (as long as the API doesn’t offer fine-tuning, in which case you need fine-tuning evals that take the API protections into account.)
  - if you are measuring how dangerous the model weights would be if they were stolen and completely under the control of bad actors… Obviously you need fine-tuning evals! Why wouldn’t you expect the bad actors to fine-tune on the task they wanted the model to do well on?!
  - if you are measuring how dangerous the model weights will be for a model you intend to publish the weights openly for, same deal. If you wait to do this testing until after you’ve made the weights public, or you make arguments like “we trust the open source community will let us know if they discover anything hazardous”. That’s an argument that makes some sense with an open source code library. Users notice a flaw, they report it, the maintainers patch the flaw, users update to latest version, bug is gone. That is not a good model for how problematic capabilities discovered in open-weight models could be handled.