In my opinion, I think the most interesting result for all of the evals is not “the model does X,” since I think it’s possible to get models to do basically anything you want with the right prompt, but that the model’s tendency to do X generally increases with model scale and RLHF steps.
I think the interestingness of this tendency depends on what X is. If X is kind-of explained by “instrumental convergence” but very well explained by “trying to be nice and helpful”, then I guess it’s still interesting but the safety takeaway seems like it should be the opposite
In my opinion, I think the most interesting result for all of the evals is not “the model does X,” since I think it’s possible to get models to do basically anything you want with the right prompt, but that the model’s tendency to do X generally increases with model scale and RLHF steps.
I think the interestingness of this tendency depends on what X is. If X is kind-of explained by “instrumental convergence” but very well explained by “trying to be nice and helpful”, then I guess it’s still interesting but the safety takeaway seems like it should be the opposite