Some troubles with Evals:
Saturation: as performance becomes better, especially surpassing the human baseline, it becomes harder to measure differences.
Gamification: optimizing for scoring high at evals tests.
Contamination: benchmarks found in the training data of models.
Problems with construct validity: measuring exactly the capability you want might be harder than you think.
Predictive validity: what do current evals tell us about future model performance?
Reference: https://arxiv.org/pdf/2405.03207
I agree with Lewis. A few clarificatory thoughts. 1. I think that the point of calling it a category mistake is exactly about expecting a “nice simple description”. It will be something within the network, but there’s no reason to believe that this something will be a single neural analog. 2. Even if there are many single neural analogs, there’s no reason to expect that all the safety-relevant properties will have them. 3. Even if all the safety-relevant properties have them, there’s no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan. So, even if you don’t buy 1+2, from 3 it follows that we have to figure this out beforehand. I’m also worried that claims such as “we can make important forward progress on particular intentional states even in the absence of such a general account.” could further lead to a slippery slope that more or less embraces having the dangerous thing first without sufficient precautions (not saying you’re in favor of that, though), especially since many of the safety-relevant states seem to be interconnected.