I’m mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the “True Values”). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?
Thanks, this is useful feedback in how I need to be more clear about what I’m claiming :) In october I’m going to be refining these posts a bit—would you be available to chat sometime?
Glad I could help! I’m going to comment more on your following post in the next few days/next week, and then I’m interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)
I’m mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the “True Values”). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?
Hum, but I feel like you’re claiming that this framing is wrong while arguing that it is too difficult to apply to be useful. Which is confusing.
Still agree that your big question is interesting though.
Thanks, this is useful feedback in how I need to be more clear about what I’m claiming :) In october I’m going to be refining these posts a bit—would you be available to chat sometime?
Glad I could help! I’m going to comment more on your following post in the next few days/next week, and then I’m interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)