This is a totally fine thing to post :) I agree with most of the things, and agree about their importance.
I think the Goodhart’s law side to CEV is more subtle. To be pithy, it’s like it doesn’t have a problem with Goodhart’s law yet because it’s not specific enough to even get noticed by Goodhart. If CEV hypothetically considers doing something bad, we can just reassure ourselves that surely that’s not what our ideal advisor would have wanted. It’s only once we pick a specific method of implementation that we have to confront in mechanistic detail what we could previously hide under the abstraction of anthropomorphic agency.
It’s only once we pick a specific method of implementation that we have to confront in mechanistic detail what we could previously hide under the abstraction of anthropomorphic agency.
I agree. I was trying to think of possible implementation methods, throwing out various constraints like computing power or competitiveness as it became harder to find any, and the final sticking point was still Goodhart’s Law. For the most part, I kept it in to give an example to the difficulty of meta-alignment (corrigibility in favourable directions).
This is a totally fine thing to post :) I agree with most of the things, and agree about their importance.
I think the Goodhart’s law side to CEV is more subtle. To be pithy, it’s like it doesn’t have a problem with Goodhart’s law yet because it’s not specific enough to even get noticed by Goodhart. If CEV hypothetically considers doing something bad, we can just reassure ourselves that surely that’s not what our ideal advisor would have wanted. It’s only once we pick a specific method of implementation that we have to confront in mechanistic detail what we could previously hide under the abstraction of anthropomorphic agency.
I agree. I was trying to think of possible implementation methods, throwing out various constraints like computing power or competitiveness as it became harder to find any, and the final sticking point was still Goodhart’s Law. For the most part, I kept it in to give an example to the difficulty of meta-alignment (corrigibility in favourable directions).