Gordon Seidoh Worley comments on Model Mis-specification and Inverse Reinforcement Learning

Gordon Seidoh Worley 28 Mar 2019 18:05 UTC
12 points
Overall I think this piece is great and does a nice job of intuitively explaining ways our attempts to model human values can fail. I notice a bit of friction when I read this part, though:
How do we choose between the theory that Bob values smoking and the theory that he does not (but smokes anyway because of the powerful addiction)? Humans choose between these theories based on our experience with addictive behaviours and our insights into people’s preferences and values. This kind of insight can’t easily be captured as formal assumptions about a model, or even as a criterion about counterfactual generalization. (The theory that Bob values smoking does make accurate predictions across a wide range of counterfactuals.) Because of this, learning human values from IRL has a more profound kind of model mis-specification than the examples in Jacob’s previous post. Even in the limit of data generated from an infinite series of random counterfactual scenarios, standard IRL algorithms would not infer someone’s true values.
I see this kind of thing often in people’s thinking: they intuitively have a sense that people can seem to value something and yet not value it because they don’t endorse that value. I think this is a confused view, though, taken from our phenomenology of values and preferences when we are also identified with (subject to) those values and preferences and then, not liking what we see in ourselves, creating an ontology that suggests that some values and preferences is not endorsed, we would prefer for them to be otherwise, but find ourselves doing something we don’t like anyway.
This sets up an interesting dialectic, because on one hand we have the very real, felt experience of wanting to do one thing (say, not smoke) and then doing the other (smoking anyway) and feeling as if doing the action (smoking) is not really what we want to do and not being “authentic” to our “true” or real self, and on the other we have the very real sense in which we are getting information about values and preferences based on behavior that suggests despite what we say (“I don’t want to smoke”) we don’t act on it. Partly we might attribute this to a lack of reflective equilibrium resulting in irrational preference ordering, although I think that abstract away most of the interesting human psychology that produces this result. Anyway, I point this out because I think there is a useful synthesis that gets us beyond these two conflicting approaches that seem to get in our way of understanding human values: it’s correct that we prefer, in this example, to smoke rather than not smoke, but it’s also true that we believe we prefer to not smoke rather than smoke, and this is only a problem in that our model may be trying to assume that our preferences match our beliefs.
Now of course our beliefs can change our preferences, but that sounds a bit confusing if we just talk about beliefs and preferences because preferences would seem to be a special kind of belief relating to an ordering over actions, which I think shows that beliefs and preferences are a leaky abstraction. To resolve this we have to look a bit deeper, probably in the direction of Friston.