But I also think that maybe there’s a small refinement of these definitions which saves most of the value for the purposes of planning for safety.
If you simply narrow the focus of the terms to “world optimization power”, then I think it basically holds.
Subskills, like working memory, may or may not be a part of a given entity being evaluated. The evaluation metric doesn’t care if it is just measuring the downstream output of world optimization.
Admittedly, this isn’t something we have particularly good measures of. It’s a wide target to try to cover with a static eval. But I still think it works conceptually, so long as we acknowledge that our measures of it are limited approximations.
Yeah, that’s a good point.
But I also think that maybe there’s a small refinement of these definitions which saves most of the value for the purposes of planning for safety.
If you simply narrow the focus of the terms to “world optimization power”, then I think it basically holds.
Subskills, like working memory, may or may not be a part of a given entity being evaluated. The evaluation metric doesn’t care if it is just measuring the downstream output of world optimization.
Admittedly, this isn’t something we have particularly good measures of. It’s a wide target to try to cover with a static eval. But I still think it works conceptually, so long as we acknowledge that our measures of it are limited approximations.