RobertM comments on When can we trust model evaluations?

RobertM 8 Aug 2023 2:44 UTC
LW: 4 AF: 2
2
AF
Curated.
The reasons I like this post:
- it’s epistemically legible
- it hedges appropriately:
“That being said, I do think there are some cases where gradient hacking might be quite easy, e.g. cases where we give the model access to a database where it can record its pre-commitments or direct access to its own weights and the ability to modify them.”)
- it has direct, practical implications for e.g. regulatory proposals
- it points out the critical fact that we’re missing the ability to evaluate for alignment given current techniques
Arguably missing is a line or two that backtracks from “we could try to get robust understanding via a non-behavioral source such as mechanistic interpretability evaluated throughout the course of training” to (my claim) “it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment, and we don’t actually know when we’re going to hit that threshold”, but that might be out of scope.
- evhub 8 Aug 2023 21:33 UTC
  LW: 4 AF: 3
  1
  AF Parent
  
  it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment
  
  I mean, like I say in the post, if you have some strong reason to believe that there’s no gradient-hacking going on, then I think this is safe in the i.i.d. setting, and likewise for exploration hacking in the RL setting. You just have to have that strong reason somehow (which is maybe what you mean by saying we can evaluate them for alignment?).