A frame for thinking about capability evaluations: outer vs. inner evaluations
When people hear the phrase “capability evaluations”, I think they are often picturing something very roughly like METR’s evaluations, where we test stuff like:
There’s a real difference[1] between the first and second categories. Rough attempt at putting it in words: the first one is treating the AI as an indivisible atom and seeing how it can affect the world, whereas the second one treats the AI as having “an inner life” and seeing how it can affect itself.
Hence the name “outer vs. inner evaluations”. (I think the alternative name “dualistic vs. embedded evaluations”, following dualistic and embedded notions of agency, gets closer at the distinction while being less snappy. Compare also to behavioral vs. cognitive psychology.)
It seems to me that the outer evaluations are better established: we have METR and the labs itself doing such capability evaluations. There’s plenty of work on inner evaluations as well, the difference being that it’s more diffused. (Maybe for good reason: it is tricky to do proper inner evals.)
I’ve gotten value out of this frame; it helps me not forget inner evals in the context of evaluating model capabilities.
Another difference is that in outer evals we often are interested in getting the most out of the model by ~any means, whereas with inner evals we might deliberately restrict the model’s action space. This difference might be best thought of as a separate axis, though.
A frame for thinking about capability evaluations: outer vs. inner evaluations
When people hear the phrase “capability evaluations”, I think they are often picturing something very roughly like METR’s evaluations, where we test stuff like:
Can the AI buy a dress from Amazon?
Can the AI solve a sudoku?
Can the AI reverse engineer this binary file?
Can the AI replicate this ML paper?
Can the AI replicate autonomously?
(See more examples at METRs repo of public tasks.)
In contrast, consider the following capabilities:
Is the AI situationally aware?
Can the AI do out-of-context reasoning?
Can the AI do introspection?
Can the AI do steganography?
Can the AI utilize filler tokens?
Can the AI obfuscate its internals?
Can the AI gradient hack?
There’s a real difference[1] between the first and second categories. Rough attempt at putting it in words: the first one is treating the AI as an indivisible atom and seeing how it can affect the world, whereas the second one treats the AI as having “an inner life” and seeing how it can affect itself.
Hence the name “outer vs. inner evaluations”. (I think the alternative name “dualistic vs. embedded evaluations”, following dualistic and embedded notions of agency, gets closer at the distinction while being less snappy. Compare also to behavioral vs. cognitive psychology.)
It seems to me that the outer evaluations are better established: we have METR and the labs itself doing such capability evaluations. There’s plenty of work on inner evaluations as well, the difference being that it’s more diffused. (Maybe for good reason: it is tricky to do proper inner evals.)
I’ve gotten value out of this frame; it helps me not forget inner evals in the context of evaluating model capabilities.
Another difference is that in outer evals we often are interested in getting the most out of the model by ~any means, whereas with inner evals we might deliberately restrict the model’s action space. This difference might be best thought of as a separate axis, though.