The specifications would correctly capture what-we-actually-mean, so they wouldn’t be prone to goodhart
I think there’s an ambiguity in “concept” here, that’s important to clarify re/ this hope. Humans use concepts in two ways:
1. as abstractions in themselves, like the idea of an ideal spring which contains its behavior within the mental object, and
2. as pointers / promissory notes towards the real objects, like “tree”.
Seems likely that any agent that has to attend to trees, will form the ~unique concept of “tree”, in the sense of a cluster of things, and minimal sets of dimensions needed to specify the relevant behavior (height, hardness of wood, thickness, whatever). Some of this is like use (1): you can simulate some of the behavior of trees (e.g. how they’ll behave when you try to cut them down and use them to build a cabin). Some of this is like use (2): if you want to know how to grow trees better, you can navigate to instances of real trees, study them to gain further relevant abstractiosn, and then use those new abstractions (nutrient intake, etc.) to grow trees better.
So what do we mean by “strawberry”, such that it’s not goodhartable? We might mean “a thing that is relevantly naturally abstracted in the same way as a strawberry is relevantly naturally abstracted”. This seems less goodhartable if we use meaning (2), but that’s sort of cheating by pointing to “what we’d think of these strawberrys upon much more reflection in many more contexts of relevance”. If we use meaning (1), that sems eminently goodhartable.
I think there’s an ambiguity in “concept” here, that’s important to clarify re/ this hope. Humans use concepts in two ways:
1. as abstractions in themselves, like the idea of an ideal spring which contains its behavior within the mental object, and
2. as pointers / promissory notes towards the real objects, like “tree”.
Seems likely that any agent that has to attend to trees, will form the ~unique concept of “tree”, in the sense of a cluster of things, and minimal sets of dimensions needed to specify the relevant behavior (height, hardness of wood, thickness, whatever). Some of this is like use (1): you can simulate some of the behavior of trees (e.g. how they’ll behave when you try to cut them down and use them to build a cabin). Some of this is like use (2): if you want to know how to grow trees better, you can navigate to instances of real trees, study them to gain further relevant abstractiosn, and then use those new abstractions (nutrient intake, etc.) to grow trees better.
So what do we mean by “strawberry”, such that it’s not goodhartable? We might mean “a thing that is relevantly naturally abstracted in the same way as a strawberry is relevantly naturally abstracted”. This seems less goodhartable if we use meaning (2), but that’s sort of cheating by pointing to “what we’d think of these strawberrys upon much more reflection in many more contexts of relevance”. If we use meaning (1), that sems eminently goodhartable.