How much can value learning be disentangled?

In the context of whether the definition of human values can disentangled from the process of approximating/implementing that definition, David asks me:

But I think it’s reasonable to assume (within the bounds of a discussion) that there is a non-terrible way (in principle) to specify things like “manipulation”. So do you disagree?

I think it’s a really good question, and its answer is related to a lot of relevant issues, so I put this here as a top-level post. My current feeling is, contrary to my previous intuitions, that things like “manipulation” might not be possible to specify in a way that leads to useful disentanglement.

Why manipulate?

First of all, we should ask why an AI would be tempted to manipulate us in the first place. It may be that it needs us to do something for it to accomplish its goal; in that case it is trying to manipulate our actions. Or maybe its goal includes something that cashes out as out mental states; in that case, it is trying to manipulate our mental state directly.

The problem is that any reasonable friendly AI would have our mental states as part of its goal—it would at least want us to be happy rather than miserable. And (almost) any AI that wasn’t perfectly indifferent to our actions would be trying to manipulate us just to get its goals accomplished.

So manipulation is to be expected by most AI designs, friendly or not.

Manipulation versus explanation

Well, since the urge to manipulate is expected to be present, could we just rule it out? The problem is that we need to define the difference between manipulation and explanation.

Suppose I am fully aligned/corrigible/nice or whatever other properties you might desire, and I want to inform you of something important and relevant. In doing so, especially if I am more intelligent than you, I will simplify, I will omit irrelevant details, I will omit arguably relevant details, I will emphasise things that help you get a better understanding of my position, and de-emphasise things that will just confuse you.

And these are exactly the same sorts of behaviours that smart manipulator would do. Nor can we define the difference as whether the AI is truthful or not. We want human understanding of the problem, not truth. It’s perfectly possible to manipulate people while telling them nothing but the truth. And if the AI structures the order in which it presents the true facts, it can manipulate people while presenting the whole truth as well as nothing but the truth.

It seems that the only difference between manipulation and explanation is whether we end up with a better understanding of the situation at the end. And measuring understanding is very subtle. And even if we do it right, note that we have now motivated the AI to… aim for a particular set of mental states. We are rewarding it for manipulating us. This is contrary to the standard understanding of manipulation, which focuses on the means, not the end result.

Bad behaviour and good values

Does this mean that the situation is completely hopeless? No. There are certain manipulative practices that we might choose to ban. Especially if the AI is limited in capability at some level, this would force it to follow behaviours that are less likely to be manipulative.

Essentially, there is no boundary between manipulation and explanation, but there is a difference between extreme manipulation and explanation, so ruling out the first can help (or maybe not).

The other thing that can be done is to ensure that the AI has values close to ours. The closer the values of the AI are to us, the less manipulation it will need to use, and the less egregious the manipulation will be. It might be that, between partial value convergence and ruling out specific practices (and maybe some physical constraints), we may be able to get an AI that is very unlikely to manipulate us much.

Incidentally, I feel the same about low-impact approaches. The full generality problem, an AI that is low impact but value-agnostic, I think is impossible. But if the values of the AI are better aligned with us, and more physically constrained, then low impact becomes easier to define.