Some troubles with Evals:
Saturation: as performance becomes better, especially surpassing the human baseline, it becomes harder to measure differences.
Gamification: optimizing for scoring high at evals tests.
Contamination: benchmarks found in the training data of models.
Problems with construct validity: measuring exactly the capability you want might be harder than you think.
Predictive validity: what do current evals tell us about future model performance?
Reference: https://arxiv.org/pdf/2405.03207
Some troubles with Evals:
Saturation: as performance becomes better, especially surpassing the human baseline, it becomes harder to measure differences.
Gamification: optimizing for scoring high at evals tests.
Contamination: benchmarks found in the training data of models.
Problems with construct validity: measuring exactly the capability you want might be harder than you think.
Predictive validity: what do current evals tell us about future model performance?
Reference: https://arxiv.org/pdf/2405.03207