That was quite a stimulating post! It pushed me to actually go through the cloud of confusion surrounding these questions in my mind, hopefully with a better picture now.
First, I was confused about your point on True Values. I was confused by what you even meant. If I understand correctly, you’re talking about a class of parametrized models of human: the agent/goal-directed model, parametrized by something like the beliefs and desires of Dennett’s intentional stance. With some non-formalized additional subtleties like the fact that desires/utilities/goals can’t just describe exactly what the system do, but must be in some sense compressed and sparse.
Now, there’s a pretty trivial sense in which there is no True Values for the parameters: because this model class lacks realizability, no parameter describes exactly and perfectly the human we want to predict. That sounds completely uncontroversial to me, but also boring.
Your claim, in my opinion, is that there are no parameters for which this model is close to good enough at predicting the human. Is that correct?
Assuming for the moment it is, this post doesn’t really argue for that point in my opinion; instead it argues for the difficulty in inferring such good parameters if they existed. For example this part:
how to resolve inconsistencies and conflicting dynamics.
how to extrapolate the inferred preferences into new and different contexts.
There is no single privileged way to do all these things, and different choices can give very different results
is really about inference, as none of your points make it impossible for a good parameter to exist—they just argue for the difficulty of finding/defining one.
Note that I’m not saying what you’re doing with this sequence is wrong; looking at Goodhart from a different perspective, especially one which tries to dissolve some of the inferring difficulties, sounds valuable to me.
Another thing I like about this post it that you made me realize why the application of Goodhart’s law to AI risk doesn’t require the existence of True Values: it’s an impossibility result, and when proving an impossibility, the more you assume the better. Goodhart is about the difficulty of using proxies in the best case scenario when there are indeed good parameters. It’s about showing the risk and danger in just “finding the right values”, even in the best world where true values do exist. So if there are no true values, the difficulty doesn’t disappear, it gets even worse (or different at the very least)
I’m mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the “True Values”). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?
Thanks, this is useful feedback in how I need to be more clear about what I’m claiming :) In october I’m going to be refining these posts a bit—would you be available to chat sometime?
Glad I could help! I’m going to comment more on your following post in the next few days/next week, and then I’m interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)
That was quite a stimulating post! It pushed me to actually go through the cloud of confusion surrounding these questions in my mind, hopefully with a better picture now.
First, I was confused about your point on True Values. I was confused by what you even meant. If I understand correctly, you’re talking about a class of parametrized models of human: the agent/goal-directed model, parametrized by something like the beliefs and desires of Dennett’s intentional stance. With some non-formalized additional subtleties like the fact that desires/utilities/goals can’t just describe exactly what the system do, but must be in some sense compressed and sparse.
Now, there’s a pretty trivial sense in which there is no True Values for the parameters: because this model class lacks realizability, no parameter describes exactly and perfectly the human we want to predict. That sounds completely uncontroversial to me, but also boring.
Your claim, in my opinion, is that there are no parameters for which this model is close to good enough at predicting the human. Is that correct?
Assuming for the moment it is, this post doesn’t really argue for that point in my opinion; instead it argues for the difficulty in inferring such good parameters if they existed. For example this part:
is really about inference, as none of your points make it impossible for a good parameter to exist—they just argue for the difficulty of finding/defining one.
Note that I’m not saying what you’re doing with this sequence is wrong; looking at Goodhart from a different perspective, especially one which tries to dissolve some of the inferring difficulties, sounds valuable to me.
Another thing I like about this post it that you made me realize why the application of Goodhart’s law to AI risk doesn’t require the existence of True Values: it’s an impossibility result, and when proving an impossibility, the more you assume the better. Goodhart is about the difficulty of using proxies in the best case scenario when there are indeed good parameters. It’s about showing the risk and danger in just “finding the right values”, even in the best world where true values do exist. So if there are no true values, the difficulty doesn’t disappear, it gets even worse (or different at the very least)
I’m mostly arguing against the naive framing where humans are assumed to have a utility function, and then we can tell how well the AI is doing by comparing the results to the actual utility (the “True Values”). The big question is: how do you formally talk about misalignment without assuming some such unique standard to judge the results by?
Hum, but I feel like you’re claiming that this framing is wrong while arguing that it is too difficult to apply to be useful. Which is confusing.
Still agree that your big question is interesting though.
Thanks, this is useful feedback in how I need to be more clear about what I’m claiming :) In october I’m going to be refining these posts a bit—would you be available to chat sometime?
Glad I could help! I’m going to comment more on your following post in the next few days/next week, and then I’m interested in having a call. We can also talk then about the way I want to present Goodhart as an impossibility result in a textbook project. ;)