evhub comments on Discovering Language Model Behaviors with Model-Written Evaluations

evhub 20 Dec 2022 23:06 UTC
LW: 8 AF: 6
1
AF
In my opinion, I think the most interesting result for all of the evals is not “the model does X,” since I think it’s possible to get models to do basically anything you want with the right prompt, but that the model’s tendency to do X generally increases with model scale and RLHF steps.
- David Johnston 20 Dec 2022 23:22 UTC
  4 points
  6
  Parent
  I think the interestingness of this tendency depends on what X is. If X is kind-of explained by “instrumental convergence” but very well explained by “trying to be nice and helpful”, then I guess it’s still interesting but the safety takeaway seems like it should be the opposite