Zach Stein-Perlman comments on DeepMind: Model evaluation for extreme risks

Zach Stein-Perlman 14 Dec 2024 23:45 UTC
2 points
0
This is the classic paper on model evals for dangerous capabilities.
On a skim, it’s aged well; I still agree with its recommendations and framing of evals. One big exception: it recommends “alignment evaluations” to determine models’ propensity for misalignment, but such evals can’t really provide much evidence against catastrophic misalignment; better to assume AIs are misaligned and use control once dangerous capabilities appear, until much better misalignment-measuring techniques appear.