I was slightly confused by the beginning of the post, but by the end I was on board with the questions asked and the problems posed.
On impacts measures, there’s already some discussions in this comment thread, but I’ll put some more thoughts about that here. My first reaction to reading the last section was to think of attainable utility: non-manipulation as preservation of attainable utility. Sitting on this idea, I’m not sure this works as a non-manipulation condition, since it lets the AI manipulate us into having what we want. There should be no risk of it changing our utility, since that’s a big change in attainable utility; but still, we might not want to be manipulated even for our own good (like some people’s reactions to nudges).
Maybe there can be an alternative version of attainable utility, something like “attainable choice”, which ensures that other agents (us included) are still able to make choices. Or to put it in terms of free will, that these agents choices are still primarily determined by internal causes, so by them, instead of primarily determined by external causes like the AI.
We can even imagining integrating attainable utility and attainable choice together (by weighting them for example), so that manipulation is avoided in a lot of cases, but the AI still manipulates Petrov to not report if not reporting saves the world (because it maintains attainable utility). So it solves the issue mentioned in this comment thread.
(I have a big google doc analyzing corrigibility & manipulation from the attainable utility landscape frame; I’ll link it here when the post goes up on LW)
I was slightly confused by the beginning of the post, but by the end I was on board with the questions asked and the problems posed.
On impacts measures, there’s already some discussions in this comment thread, but I’ll put some more thoughts about that here. My first reaction to reading the last section was to think of attainable utility: non-manipulation as preservation of attainable utility. Sitting on this idea, I’m not sure this works as a non-manipulation condition, since it lets the AI manipulate us into having what we want. There should be no risk of it changing our utility, since that’s a big change in attainable utility; but still, we might not want to be manipulated even for our own good (like some people’s reactions to nudges).
Maybe there can be an alternative version of attainable utility, something like “attainable choice”, which ensures that other agents (us included) are still able to make choices. Or to put it in terms of free will, that these agents choices are still primarily determined by internal causes, so by them, instead of primarily determined by external causes like the AI.
We can even imagining integrating attainable utility and attainable choice together (by weighting them for example), so that manipulation is avoided in a lot of cases, but the AI still manipulates Petrov to not report if not reporting saves the world (because it maintains attainable utility). So it solves the issue mentioned in this comment thread.
(I have a big google doc analyzing corrigibility & manipulation from the attainable utility landscape frame; I’ll link it here when the post goes up on LW)
When do you plan on posting this? I’m interested in reading it
Ideally within the next month!