Using lying to detect human values
In my current research, I’ve often re-discovering things that are trivial and obvious, but that suddenly become mysterious. For instance, it’s blindingly obvious that the anchoring bias is a bias, and almost everyone agrees on this. But this becomes puzzling when we realise that there is no principled ways of deducing the rationality and reward of irrational agents.
Here’s another puzzle. Have you ever seen someone try and claim that they have certain values that they manifestly don’t have? Seen their facial expression, their grimaces, their hesitation, and so on.
There’s an immediate and trivial explanation: they’re lying, and they’re doing it badly (which is why we can actually detect the lying). But remember that there is no way of detecting the preferences of an irrational agent. How can someone lie about something that is essentially non-existent, their values? Even if someone knew their own values, why would the tell-tale signs of lying surface, since there’s no way that anyone else could ever check their values, even in principle?
But here evolution is helping us. Humans have a self-model of their own values; indeed, this is what we use to define what those values are. And evolution, being lazy, re-uses the self-model to interpret others. Since these self-models are broadly similar from person to person, people tend to agree about the rationality and values of other humans.
So, because of these self-models, our own values “feel” like facts. And because evolution is lazy, lying and telling the truth about our own values triggers the same responses as lying or telling the truth about facts.
This suggests another way of accessing the self-model of human values: train an AI to detect human lying and misdirection on factual matters, then feed that AI a whole corpus of human moral/value/preference statements. Given the normative assumption that lying on facts resembles lying on values, this is another avenue by which AIs can learn human values.
So far, I’ve been assuming that human values are a single, definite object. In my next post, I’ll look at the messy reality of under-defined and contradictory human values.
- Policy Alignment by 30 Jun 2018 0:24 UTC; 51 points) (
- Future directions for ambitious value learning by 11 Nov 2018 15:53 UTC; 48 points) (
- AI Alignment Problem: “Human Values” don’t Actually Exist by 22 Apr 2019 9:23 UTC; 45 points) (
- Resolving human values, completely and adequately by 30 Mar 2018 3:35 UTC; 32 points) (
- 31 Mar 2018 17:37 UTC; 3 points) 's comment on Resolving human values, completely and adequately by (
- 30 Mar 2018 1:45 UTC; 2 points) 's comment on Announcement: AI alignment prize winners and next round by (
A. Revealed preference. B. The duality “I want to sleep on time but I also want to play video games.” someone can “want to sleep” but also want another thing and do another thing.
I’m suspicious. Is there a reference for this claim? It seems for this to be true we need to at least be very precise about what we mean by “preferences”.
Pretty sure that he meant to say “an irrational agent” instead of “a rational agent”, see https://arxiv.org/abs/1712.05812
Indeed! I’ve now corrected that error.
If someone claims to have a value that he obviously hasn’t—it doesn’t mean that he is lying, that is, consciously presenting wrong information. In most cases I observed of such behaviour, they truly belived that they are kind, animal loving whatever positive beings, and non-consciously ignored the instances of their own behaviour which demonstrated other set of preferences—strikingly obvious for external observers.
“I value animals” is pretty worthless; “I value animals more than economic growth” would be more informative.