do you think any reasonable extension of these kinds of ideas could get what we want?
Conditional on avoiding Goodhart, I think you could probably get something that looks a lot like a diamond maximiser. It might not be perfect, the situation with the “most diamond” might not be the maximum of it’s utility function, but I would expect the maximum of it’s utility function will still contain a very large amount of diamond. For instance, depending on the representation, and the way the programmers baked in the utilty function, it might have a quirk in it’s utility function of only recognizing something as a diamond if it’s stereotypically “diamond shaped”. This would bar it from just building pure carbon planets to achieve it’s goal.
IMO, you’d need something else outside of the ideas presented to get a “perfect” diamond maximizer.
Conditional on avoiding Goodhart, I think you could probably get something that looks a lot like a diamond maximiser. It might not be perfect, the situation with the “most diamond” might not be the maximum of it’s utility function, but I would expect the maximum of it’s utility function will still contain a very large amount of diamond. For instance, depending on the representation, and the way the programmers baked in the utilty function, it might have a quirk in it’s utility function of only recognizing something as a diamond if it’s stereotypically “diamond shaped”. This would bar it from just building pure carbon planets to achieve it’s goal.
IMO, you’d need something else outside of the ideas presented to get a “perfect” diamond maximizer.