The problem is that the ‘human interpretation module’ might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.
I know, but my point is that such a model might be very perverse, such as “Humans do not expect to find out that you presented misleading information.” rather than “Humans do not expect that you present misleading information.”
You’re right. This thing can come up in terms of “predicting human behaviour”, if the AI is sneaky enough. It wouldn’t come up in “compare human models of the world to reality”. So there are subtle nuances there to dig into...
The problem is that the ‘human interpretation module’ might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.
The module is supposed to be a predictive model of what humans mean or expect, rather than something that “convinces” or does anything like that.
I know, but my point is that such a model might be very perverse, such as “Humans do not expect to find out that you presented misleading information.” rather than “Humans do not expect that you present misleading information.”
You’re right. This thing can come up in terms of “predicting human behaviour”, if the AI is sneaky enough. It wouldn’t come up in “compare human models of the world to reality”. So there are subtle nuances there to dig into...