Regarding model-written evaluations in 1. Behavioral Non-Fine-Tuning Evaluations you write:
… this style of evaluation is very easy for the model to game: since there’s no training process involved in these evaluations that would penalize the model for getting the wrong answer here, a model that knows it’s being evaluated can just pick whatever answer it wants so as to trick the evaluator into thinking whatever the model wants the evaluator to think.
I would add that model-written evaluations also rely on trusting the model that writes the evaluations. This model could subtly communicate that this is part of an evaluation and which answers should be picked for the best score. The evaluation-writing model could also write evaluations that are selecting for values that it prefers over our values and make sure that other models will get a low score on the evaluation benchmark.
Regarding model-written evaluations in 1. Behavioral Non-Fine-Tuning Evaluations you write:
I would add that model-written evaluations also rely on trusting the model that writes the evaluations. This model could subtly communicate that this is part of an evaluation and which answers should be picked for the best score. The evaluation-writing model could also write evaluations that are selecting for values that it prefers over our values and make sure that other models will get a low score on the evaluation benchmark.