So “no manipulation” or “maintaining human free will” seems to require a form of indifference: we want the AI to know how its actions affect our decisions, but not take that influence into account when choosing those actions.
Two thoughts.
One, this seems likely to have some overlap with notions of impact and impact measures.
Two, it seems like there’s no real way to eliminate manipulation in a very broad sense, because we’d expect our AI to be causally entangled with the human, so there’s no action the AI could take that would not influence the human in some way. Whether or not there is manipulation seems to require making a choice about what kind of changes in the human’s behavior matter, similar to problems we face in specifying values or defining concepts.
Not Stuart, but I agree there’s overlap here. Personally, I think about manipulation as when an agent’s policy robustly steers the human into taking a certain kind of action, in a way that’s robust to the human’s counterfactual preferences. Like if I’m choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I’m probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.
Like if I’m choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I’m probably being manipulated.
Manipulation 101: tell people “We only have blue shoes in stock. Take it or leave it.”
EDIT: This example was intentionally chosen because it could be true. How do we distinguish between ‘effects of the truth’ and ‘manipulation’?
Speculative: It’s possible that things we see as maladaptive (why ‘resist the truth?’ - “it is never rational to do so”) may exist because of difficulties we have distinguishing the two.
By looking for manipulation on the basis of counterfactuals, you’re at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can’t notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice what would have otherwise been considered manipulation.
Perhaps my point is to generally express doubt that we can cleanly detect manipulation outside the context of the human behavioral norms, and I suspect the cognitive machinery that implements norms is malleable enough that it can be manipulated to not notice what it would have previously thought was manipulation, nor is it clear this is always bad, since in some cases we might be mistaken in some sense about what is really manipulative, although this belies the point that it’s not clear what it means to be mistaken about normative claims.
OK, but there’s a difference between “here’s a definition of manipulation that’s so waterproof you couldn’t break it if you optimized against it with arbitrarily large optimization power” and “here’s my current best way of thinking about manipulation.” I was presenting the latter, because it helps me be less confused than if I just stuck to my previous gut-level, intuitive understanding of manipulation.
Edit: Put otherwise, I was replying more to your point (1) than your point (2) in the original comment. Sorry for the ambiguity!
I agree. The important part of cases 5 & 6, where some other agent “manipulates” Petrov, is that suddenly, to us human readers, it seems like the protagonist of the story (and we do model it as a story) is the cook/kidnapper, not Petrov.
I’m fine with the AI choosing actions using a model of the world that includes me. I’m not fine with it supplanting me from my agent-shaped place in the story I tell about my life.
Two thoughts.
One, this seems likely to have some overlap with notions of impact and impact measures.
Two, it seems like there’s no real way to eliminate manipulation in a very broad sense, because we’d expect our AI to be causally entangled with the human, so there’s no action the AI could take that would not influence the human in some way. Whether or not there is manipulation seems to require making a choice about what kind of changes in the human’s behavior matter, similar to problems we face in specifying values or defining concepts.
Not Stuart, but I agree there’s overlap here. Personally, I think about manipulation as when an agent’s policy robustly steers the human into taking a certain kind of action, in a way that’s robust to the human’s counterfactual preferences. Like if I’m choosing which pair of shoes to buy, and I ask the AI for help, and no matter what preferences I had for shoes to begin with, I end up buying blue shoes, then I’m probably being manipulated. A non-manipulative AI would act in a way that increases my knowledge and lets me condition my actions on my preferences.
Manipulation 101: tell people “We only have blue shoes in stock. Take it or leave it.”
EDIT: This example was intentionally chosen because it could be true. How do we distinguish between ‘effects of the truth’ and ‘manipulation’?
Speculative: It’s possible that things we see as maladaptive (why ‘resist the truth?’ - “it is never rational to do so”) may exist because of difficulties we have distinguishing the two.
Hmm, I see some problems here.
By looking for manipulation on the basis of counterfactuals, you’re at the mercy of your ability to find such counterfactuals, and that ability can also be manipulated such that you can’t notice either the object level counterfactuals that would make you suspect manipulation of the counterfactuals about your counterfactual reasoning that would make you suspect manipulation. This seems insufficiently robust way to detect manipulation, or even define it since the mechanism of detecting it can itself be manipulated to not notice what would have otherwise been considered manipulation.
Perhaps my point is to generally express doubt that we can cleanly detect manipulation outside the context of the human behavioral norms, and I suspect the cognitive machinery that implements norms is malleable enough that it can be manipulated to not notice what it would have previously thought was manipulation, nor is it clear this is always bad, since in some cases we might be mistaken in some sense about what is really manipulative, although this belies the point that it’s not clear what it means to be mistaken about normative claims.
OK, but there’s a difference between “here’s a definition of manipulation that’s so waterproof you couldn’t break it if you optimized against it with arbitrarily large optimization power” and “here’s my current best way of thinking about manipulation.” I was presenting the latter, because it helps me be less confused than if I just stuck to my previous gut-level, intuitive understanding of manipulation.
Edit: Put otherwise, I was replying more to your point (1) than your point (2) in the original comment. Sorry for the ambiguity!
I agree. The important part of cases 5 & 6, where some other agent “manipulates” Petrov, is that suddenly, to us human readers, it seems like the protagonist of the story (and we do model it as a story) is the cook/kidnapper, not Petrov.
I’m fine with the AI choosing actions using a model of the world that includes me. I’m not fine with it supplanting me from my agent-shaped place in the story I tell about my life.