Just to self-promote, the idea that alignment means precisely matching a specific utility function is one of the main things I wrote Reducing Goodhart to address. The point that this is incompatible with humans being physical systems should be reached in the first two posts.
Though I’ve gotten some good feedback since writing it and am going to do some rewriting (including explaining that “reducing” has multiple meanings), so if you find it hard to muddle through, check back in a month.
Just to self-promote, the idea that alignment means precisely matching a specific utility function is one of the main things I wrote Reducing Goodhart to address. The point that this is incompatible with humans being physical systems should be reached in the first two posts.
Though I’ve gotten some good feedback since writing it and am going to do some rewriting (including explaining that “reducing” has multiple meanings), so if you find it hard to muddle through, check back in a month.