I feel like it’s somewhat clarifying for me to replace the words “eval” and “evaluation” in this post with “experiment”. In particular, I think this highlights the extent to which “eval” has a particular connotation, but could actually be extremely broad.
I wonder if this post would have been better served by talking about a more specific category of experiment rather than talking about all types of experiments one might run to determine properties of a particular AI.
In particular I feel like there are two somewhat natural different categories of experiments people want to do related to properties of partiucular AIs[1]:
Propensity tests: will the model do a particular behavior in practice in some case? This is an extremely broad category of experiments/tests in principle, so I think it would be hard to have a science of “propensity tests” without targeting something more specific. (I’m not sure what the natural sub-categories are here.)
Capability tests: is the model “capable” of doing something? And within capability evaluations there are several sub-categories that feel natural to me:
Self-elicitation: If the model was deceptively aligned / scheming, would it be capable of doing something if it “wanted to”?
Future elicitation: If the model was open sourced or otherwise made broadly available, then in a few years, would actors be able to get the model to do the thing using the best tools/elicitation available?
Current elicitation: Could people use current tools/elicitation to make the model do the thing? This is most natural if we imagine misuse concerns on an AI API, so we might also want to take into account the amount of finetuning etc. that will be allowed. (But if we’re actually running a capability evaluation, we’d ignore harmlessness training; this could be to ensure we’re conservative and to make the evaluation methodologically simpler.)
I think this post might have been able to make a clearer and more specific point if it was instead “we need a science of capability evaluations” or “we need a science of dangerous action refusal evaluations” or similar.
I think this relatively well classifies all types of tests I’ve heard of, but it’s totally plausible to me there are things which aren’t well categorized as either of these.
I somewhat agree with the sentiment. We found it a bit hard to scope the idea correctly. Defining subcategories as you suggest and then diving into each of them is definitely on the list of things that I think are necessary to make progress on them.
I’m not sure the post would have been better if we used a more narrow title, e.g. “We need a science of capability evaluations” because the natural question then would be “But why not for propensity tests or for this other type of eval. I think the broader point of “when we do evals, we need some reason to be confident in the results no matter which kind of eval” seems to be true across all of them.
I feel like it’s somewhat clarifying for me to replace the words “eval” and “evaluation” in this post with “experiment”. In particular, I think this highlights the extent to which “eval” has a particular connotation, but could actually be extremely broad.
I wonder if this post would have been better served by talking about a more specific category of experiment rather than talking about all types of experiments one might run to determine properties of a particular AI.
In particular I feel like there are two somewhat natural different categories of experiments people want to do related to properties of partiucular AIs[1]:
Propensity tests: will the model do a particular behavior in practice in some case? This is an extremely broad category of experiments/tests in principle, so I think it would be hard to have a science of “propensity tests” without targeting something more specific. (I’m not sure what the natural sub-categories are here.)
Capability tests: is the model “capable” of doing something? And within capability evaluations there are several sub-categories that feel natural to me:
Self-elicitation: If the model was deceptively aligned / scheming, would it be capable of doing something if it “wanted to”?
Future elicitation: If the model was open sourced or otherwise made broadly available, then in a few years, would actors be able to get the model to do the thing using the best tools/elicitation available?
Current elicitation: Could people use current tools/elicitation to make the model do the thing? This is most natural if we imagine misuse concerns on an AI API, so we might also want to take into account the amount of finetuning etc. that will be allowed. (But if we’re actually running a capability evaluation, we’d ignore harmlessness training; this could be to ensure we’re conservative and to make the evaluation methodologically simpler.)
I think this post might have been able to make a clearer and more specific point if it was instead “we need a science of capability evaluations” or “we need a science of dangerous action refusal evaluations” or similar.
I think this relatively well classifies all types of tests I’ve heard of, but it’s totally plausible to me there are things which aren’t well categorized as either of these.
I somewhat agree with the sentiment. We found it a bit hard to scope the idea correctly. Defining subcategories as you suggest and then diving into each of them is definitely on the list of things that I think are necessary to make progress on them.
I’m not sure the post would have been better if we used a more narrow title, e.g. “We need a science of capability evaluations” because the natural question then would be “But why not for propensity tests or for this other type of eval. I think the broader point of “when we do evals, we need some reason to be confident in the results no matter which kind of eval” seems to be true across all of them.