If optimizing an arbitrary somewhat-but-not-perfectly-right utility function gives rise to serious Goodhart-related concerns
One thing I’ve been thinking about recently is: why does this happen? Could we have predicted the general phenomenon in advance, without imagining individual scenarios? What aspect of the structure of optimal goal pursuit in an environment reliably produces this result?
One thing I’ve been thinking about recently is: why does this happen? Could we have predicted the general phenomenon in advance, without imagining individual scenarios? What aspect of the structure of optimal goal pursuit in an environment reliably produces this result?