Agents That Learn From Human Behavior Can’t Learn Human Values That Humans Haven’t Learned Yet
[Epistemic status: ¯\_(ツ)_/¯ ]
Armstrong and Mindermann write about a no free lunch theorem for inverse reinforcement learning (IRL): the same action can reflect many different combinations of values and (irrational) planning algorithms.
I think even assuming humans were fully rational expected utility maximizers, there would be an important underdetermination problem with IRL and with all other approaches that infer human preferences from their actual behavior. This is probably obvious if and only if it’s correct, and I don’t know if any non-straw people disagree, but I’ll expand on it anyway.
Consider two rational expected utility maximizing humans, Alice and Bob.
Alice is, herself, a value learner. She wants to maximize her true utility function, but she doesn’t know what it is, so in practice she uses a probability distribution over several possible utility functions to decide how to act.
If Alice received further information (from a moral philosopher, maybe), she’d start maximizing a specific one of those utility functions instead. But we’ll assume that her information stays the same while her utility function is being inferred, and she’s not doing anything to get more; perhaps she’s not in a position to.
Bob, on the other hand, isn’t a value learner. He knows what his utility function is: it’s a weighted sum of the same several utility functions. The relative weights in this mix happen to be identical to Alice’s relative probabilities.
Alice and Bob will act the same. They’ll maximize the same linear combination of utility functions, for different reasons. But if you could find out more than Alice knows about her true utility function, then you’d act differently if you wanted to truly help Alice than if you wanted to truly help Bob.
So in some cases, it’s not enough to look at how humans behave. Humans are Alice on some points and Bob on some points. Figuring out details will require explicitly addressing human moral uncertainty.
This is the key fact about Alice’s behavior, which distinguishes it from Bob’s behavior, so the question is whether an AI can learn that fact.
Of course the AI could if it ever observed Alice in a situation where she learned anything about morality.
Or any case that has any mutual information with how Alice would respond to moral facts. (For a sufficiently smart reasoner that includes everything—e.g. watching Alice eat breakfast gives you lots of general information about her brain, which in turn lets you make better predictions about how she would behave in other cases.)
And of course the AI would tend to create situations where Alice learned moral facts, since that’s a very natural response to uncertainty about how she’d respond to moral facts.
So overall it seems like you’d have to restrict the behavior of the IRL agent quite far before this becomes a problem.
Yeah. In general, if we want a machine to learn from data as well as we can, we need to give it a prior that’s as good as ours. And there’s no guarantee that such a prior can itself be learned from data, because we didn’t learn it from data (a lot of it came with our brain structure). We can try giving the machine more data, but we don’t know how much or what kind of data would be enough.
I think that this might not end up being a problem if the value learning agent can communicate with Alice (e.g. in the context of CIRL). If they don’t get any info from moral philosophers, then they should probably maximise something like the expectation of her utility function for the same reason that Alice does. If they do get info, they can just give Alice that info, see what she does, and act accordingly. I think the real problem comes in in the realistic case where Alice isn’t handling moral uncertainty perfectly, so the value learning agent shouldn’t actually maximise the weighted sum of the utility functions she’s uncertain over.
Huh, not sure why I didn’t say this when I first read this post, but there is a difference between Alice and Bob—Alice will seek out information about her utility function, while Bob will not.
Certainly any value learning method will have to account for the fact that humans do not in fact know their own values, but it’s not the case that such behavior is indistinguishable from behavior that maximizes a utility function.
I meant to assume that away:
In cases where you’re not in a position to get more information about your utility function (e.g. because the humans you’re interacting with don’t know the answer), your behavior won’t depend on whether or not you think it would be useful to have more information about your utility function, so someone observing your behavior can’t infer the latter from the former.
Maybe practical cases aren’t like this, but it seems to me like they’d only have to be like this with respect to at least one aspect of the utility function for it to be a problem.
Paul above seems to think it would be possible to reason from actual behavior to counterfactual behavior anyway, I guess because he’s thinking in terms of modeling the agent as a physical system and not just as an agent, but I’m confused about that so I haven’t responded and I don’t claim he’s wrong.
Oh yeah, I agree with Paul’s comment and it’s saying the same thing as what I’m saying. Didn’t see it because I was reading on the Alignment Forum instead of LessWrong. I’ve moved that comment to the Alignment Forum now.
There seems to be an assumption in this post that a value learner will be learning utility functions directly; and since utility functions are something which is associated with behavior, this framing leads to a focus on learning utility functions from behavior, and hence this post.
It seems to me that a value learner shouldn’t try to learn any given individual’s utility functions directly; rather it should first learn the psychological content corresponding to values, and then construct utility functions out of that. Among other positive features, this would allow a value learner to predict how the human would behave in a situation which the human hadn’t been exposed to yet (or even one which was totally alien to the human’s current conceptual landscape).
Here’s how I’d summarize my disagreement with the main claim: Alice is not acting rationally in your thought experiment if she acts like Bob (under some reasonable assumptions). In particular, she is doing pure exploitation and zero (value-)exploration by just maximizing her current weighted sum. For example, she should be reading philosophy papers.
See my reply to Rohin above—I wasn’t very clear about it in the OP, but I meant to consider questions where the AI knows no philosophy papers etc. are available.
There is one more fact that is ignored in IRL: that humans often have contradicting values. For example, if I want a cake very much, but also have a strong inclination for dieting, I will do nothing. So I have two values, which exactly compensate each other and have zero effects on behaviour. Observing only behaviour will not give a clue about them. More complex examples are possible, where contradicting values create inconsistent behaviour, and it is very typical for biological humans.
I’m not sure IRL actually ignores this, although in such a case the value learning agent may never converge on a consistent policy.