So if I understand correctly, the problem with the naive proposal is something like this: We tell our AI to develop a cure for cancer while minimizing side effects. The AI cures cancer, but it keeps the cure a secret because if it told us the cure, that would create the side effect of us curing a bunch of people. We can’t just tell the AI to minimize side effects prior to task completion, because then it could set up a time bomb that goes off and generates lots of side effects after the task is complete.
Another way to put the problem: We’d like for the AI to be corrigible and also minimize side effects. Suppose the AI forecasts that its actions will motivate humans to take drastic action, with a large impact on the world, in order to interfere. A corrigible AI shouldn’t work to stop this outcome. But a side effect-minimizing AI might decide to manipulate humans so they don’t take drastic action. (This example seems a bit contrived because if corrigibility is working properly, you should be able to just use the off switch, and using the off switch doesn’t seem all that high-impact?) Anyway, a possible way to address this issue would be to learn an impact measure that rates manipulating humans as a very high-impact action?
The AI cures cancer, but it keeps the cure a secret because if it told us the cure, that would create the side effect of us curing a bunch of people.
Yes, if we told it to develop a cure, it might avoid letting us cure people to minimize impact (although I think there are even less benign failure modes that would be more likely to occur).
Regarding the second framing: perhaps a side effect minimizer using a naive counterfactual would do that, yes. The problem with viewing “manipulation” as high-impact is robustly defining manipulation. There’s heavy value connotations with “free will” there.
The way I would put it is that the naive counterfactual plus whitelisting tries to stop other people from doing things that could lead to side effects, enforcing the impact measure on all actors. This is obviously terrible. Assuming agency allows for a solution* like the one I outline here.
So if I understand correctly, the problem with the naive proposal is something like this: We tell our AI to develop a cure for cancer while minimizing side effects. The AI cures cancer, but it keeps the cure a secret because if it told us the cure, that would create the side effect of us curing a bunch of people. We can’t just tell the AI to minimize side effects prior to task completion, because then it could set up a time bomb that goes off and generates lots of side effects after the task is complete.
Another way to put the problem: We’d like for the AI to be corrigible and also minimize side effects. Suppose the AI forecasts that its actions will motivate humans to take drastic action, with a large impact on the world, in order to interfere. A corrigible AI shouldn’t work to stop this outcome. But a side effect-minimizing AI might decide to manipulate humans so they don’t take drastic action. (This example seems a bit contrived because if corrigibility is working properly, you should be able to just use the off switch, and using the off switch doesn’t seem all that high-impact?) Anyway, a possible way to address this issue would be to learn an impact measure that rates manipulating humans as a very high-impact action?
Yes, if we told it to develop a cure, it might avoid letting us cure people to minimize impact (although I think there are even less benign failure modes that would be more likely to occur).
Regarding the second framing: perhaps a side effect minimizer using a naive counterfactual would do that, yes. The problem with viewing “manipulation” as high-impact is robustly defining manipulation. There’s heavy value connotations with “free will” there.
The way I would put it is that the naive counterfactual plus whitelisting tries to stop other people from doing things that could lead to side effects, enforcing the impact measure on all actors. This is obviously terrible. Assuming agency allows for a solution* like the one I outline here.