If you were trying to design something that acts like agent 2 and were stuck in a mindsight of “it must be maximizing some utility function, let’s just think in utility function terms” you might find it difficult.
(Side point) I’m not sure how much the arguments in Eliezer’s linked post actually apply outside the consequentialist context, so I’m not convinced that coherence necessarily implies a possible utility function for non-consequentialist agents.
It might be that the closest thing to what we want that we can actually figure out how to make actually isn’t coherent. In which case we would face a choice between
making it and hoping that its likely self-modification towards coherence won’t ruin it’s alignment, or
making something else that is coherent to start with but is less aligned
Yes, but:
If you were trying to design something that acts like agent 2 and were stuck in a mindsight of “it must be maximizing some utility function, let’s just think in utility function terms” you might find it difficult.
(Side point) I’m not sure how much the arguments in Eliezer’s linked post actually apply outside the consequentialist context, so I’m not convinced that coherence necessarily implies a possible utility function for non-consequentialist agents.
It might be that the closest thing to what we want that we can actually figure out how to make actually isn’t coherent. In which case we would face a choice between
making it and hoping that its likely self-modification towards coherence won’t ruin it’s alignment, or
making something else that is coherent to start with but is less aligned
While (a) is risky (b) seems worse to me.