Problem one can be addressed by only allowing certain questions/orders to be given.
Problem two is a real problem, with no solution currently.
Problem three sounds like it isn’t a problem—the initial model the AI has of a human, is not of a wireheaded human (though it is of a wireheadable human). What exactly did you have in mind?
Which leads to the obvious question of whether figuring out the rules about the questions is much simpler than figuring out the rules for morality. Do you have a specific, simple class of questions/orders in mind?
Yes, but it seems to me that your approach is dependent on an ‘immoral’ system: simulating humans in too high detail. In other cases, one might attempt to make a nonperson predicate and eliminate all models that fail, or something. However, your idea seems to depend on simulated humans.
Well, it depends on how the model of the human works and how it is asked questions. That would probably depend a lot on how the original AI structured the model of the human, and we don’t currently have any AIs to test that with. The point is, though, that in certain cases, the AI might compromise the human, for instance by wireheading it or convincing it of a religion or something, and then the compromised human might command destructive things. There’s a huge, hidden amount of trickiness, such as determining how to give the human correct information to decide etc etc.
3 is the general problem of AI’s behaving badly. The way that this approach is supposed to avoid that is by having constructing a “human interpretation module” that is maximally accurate, and then using that module+human instructions to be the motivation of the AI.
Basically I’m using a lot of the module approach (and the “false miracle” stuff to get counterfactuals): the AI that builds the human interpretation module will build it for the purpose of making it accurate, and the one that uses it will have it as part of its motivation. The old problems may rear their heads again if the process is ongoing, but “module X” + “human instructions” + “module X’s interpretation of human instructions” seems rather solid as a one-off initial motivation.
The problem is that the ‘human interpretation module’ might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.
I know, but my point is that such a model might be very perverse, such as “Humans do not expect to find out that you presented misleading information.” rather than “Humans do not expect that you present misleading information.”
You’re right. This thing can come up in terms of “predicting human behaviour”, if the AI is sneaky enough. It wouldn’t come up in “compare human models of the world to reality”. So there are subtle nuances there to dig into...
Problem one can be addressed by only allowing certain questions/orders to be given.
Problem two is a real problem, with no solution currently.
Problem three sounds like it isn’t a problem—the initial model the AI has of a human, is not of a wireheaded human (though it is of a wireheadable human). What exactly did you have in mind?
Which leads to the obvious question of whether figuring out the rules about the questions is much simpler than figuring out the rules for morality. Do you have a specific, simple class of questions/orders in mind?
Yes, but it seems to me that your approach is dependent on an ‘immoral’ system: simulating humans in too high detail. In other cases, one might attempt to make a nonperson predicate and eliminate all models that fail, or something. However, your idea seems to depend on simulated humans.
Well, it depends on how the model of the human works and how it is asked questions. That would probably depend a lot on how the original AI structured the model of the human, and we don’t currently have any AIs to test that with. The point is, though, that in certain cases, the AI might compromise the human, for instance by wireheading it or convincing it of a religion or something, and then the compromised human might command destructive things. There’s a huge, hidden amount of trickiness, such as determining how to give the human correct information to decide etc etc.
3 is the general problem of AI’s behaving badly. The way that this approach is supposed to avoid that is by having constructing a “human interpretation module” that is maximally accurate, and then using that module+human instructions to be the motivation of the AI.
Basically I’m using a lot of the module approach (and the “false miracle” stuff to get counterfactuals): the AI that builds the human interpretation module will build it for the purpose of making it accurate, and the one that uses it will have it as part of its motivation. The old problems may rear their heads again if the process is ongoing, but “module X” + “human instructions” + “module X’s interpretation of human instructions” seems rather solid as a one-off initial motivation.
The problem is that the ‘human interpretation module’ might give the wrong results. For instance, if it convinces people that X is morally obligatory, it might interpret that as X being morally obligatory. It is not entirely obvious to me that it would be useful to have a better model. It probably depends on what the original AI wants to do.
The module is supposed to be a predictive model of what humans mean or expect, rather than something that “convinces” or does anything like that.
I know, but my point is that such a model might be very perverse, such as “Humans do not expect to find out that you presented misleading information.” rather than “Humans do not expect that you present misleading information.”
You’re right. This thing can come up in terms of “predicting human behaviour”, if the AI is sneaky enough. It wouldn’t come up in “compare human models of the world to reality”. So there are subtle nuances there to dig into...