3 is the general problem of AI’s behaving badly. The way that this approach is supposed to avoid that is by having constructing a “human interpretation module” that is maximally accurate, and then using that module+human instructions to be the motivation of the AI.
Basically I’m using a lot of the module approach (and the “false miracle” stuff to get counterfactuals): the AI that builds the human interpretation module will build it for the purpose of making it accurate, and the one that uses it will have it as part of its motivation. The old problems may rear their heads again if the process is ongoing, but “module X” + “human instructions” + “module X’s interpretation of human instructions” seems rather solid as a one-off initial motivation.
The problem is that the ‘human interpretation module’ might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.
I know, but my point is that such a model might be very perverse, such as “Humans do not expect to find out that you presented misleading information.” rather than “Humans do not expect that you present misleading information.”
You’re right. This thing can come up in terms of “predicting human behaviour”, if the AI is sneaky enough. It wouldn’t come up in “compare human models of the world to reality”. So there are subtle nuances there to dig into...
3 is the general problem of AI’s behaving badly. The way that this approach is supposed to avoid that is by having constructing a “human interpretation module” that is maximally accurate, and then using that module+human instructions to be the motivation of the AI.
Basically I’m using a lot of the module approach (and the “false miracle” stuff to get counterfactuals): the AI that builds the human interpretation module will build it for the purpose of making it accurate, and the one that uses it will have it as part of its motivation. The old problems may rear their heads again if the process is ongoing, but “module X” + “human instructions” + “module X’s interpretation of human instructions” seems rather solid as a one-off initial motivation.
The problem is that the ‘human interpretation module’ might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.
The module is supposed to be a predictive model of what humans mean or expect, rather than something that “convinces” or does anything like that.
I know, but my point is that such a model might be very perverse, such as “Humans do not expect to find out that you presented misleading information.” rather than “Humans do not expect that you present misleading information.”
You’re right. This thing can come up in terms of “predicting human behaviour”, if the AI is sneaky enough. It wouldn’t come up in “compare human models of the world to reality”. So there are subtle nuances there to dig into...