habryka answers Can coherent extrapolated volition be estimated with Inverse Reinforcement Learning?

habryka 15 Apr 2019 18:25 UTC
15 points
Did you read Rohin Shah’s value learning sequence? It covers this whole area in a good amount of detail, and I think answers your question pretty straightforwardly:
Existing error models for inverse reinforcement learning tend to be very simple, ranging from Gaussian noise in observations of the expert’s behavior or sensor readings, to the assumption that the expert’s choices are randomized with a bias towards better actions.
In fact humans are not rational agents with some noise on top. Our decisions are the product of a complicated mess of interacting process, optimized by evolution for the reproduction of our children’s children. It’s not clear there is any good answer to what a “perfect” human would do. If you were to find any principled answer to “what is the human brain optimizing?” the single most likely bet is probably something like “reproductive success.” But this isn’t the answer we are looking for.
I don’t think that writing down a model of human imperfections, which describes how humans depart from the rational pursuit of fixed goals, is likely to be any easier than writing down a complete model of human behavior.
We can’t use normal AI techniques to learn this kind of model, either — what is it that makes a model good or bad? The standard view — “more accurate models are better” — is fine as long as your goal is just to emulate human performance. But this view doesn’t provide guidance about how to separate the “good” part of human decisions from the “bad” part.
Here is a link to the full sequence: https://www.lesswrong.com/s/4dHMdK5TLN6xcqtyc
- Rohin Shah 16 Apr 2019 17:33 UTC
  3 points
  Parent
  Fwiw the quoted section was written by Paul Christiano, and I have used that blog post in my sequence (with permission).
  Also, for this particular question you can read just Chapter 1 of the sequence.
  - habryka 16 Apr 2019 17:38 UTC
    3 points
    Parent
    Ah, yes. Sorry. Should have made the authorship that quote clearer.
- Jade Bishop 15 Apr 2019 20:03 UTC
  1 point
  Parent
  Thank you for your feedback! I haven’t read this yet, but it comes pretty close to a discussion I had with a friend over this post.
  
  Essentially, her argument started with a simple counterargument: She bought peanut M&Ms when she didn’t want to, and didn’t realise she was doing it until afterwards. In a similar situation where she was hungry and in the same place, she desired peanut M&Ms to satisfy her hunger, but this time she didn’t want them. She knew she didn’t want peanut M&Ms, and didn’t consciously decide to get them against that want; in this sense, I think a parallel can be drawn with akrasia, where rationality alone isn’t enough.
  
  Her point was this: There has to be a line drawn between “intentional conscious action” and “the result of a complex system of interacting parts that puppets the meat sack that holds our brain, sometimes in ways we don’t intend.” On a base level, this could result in, say, an AI that acts like a normal human but sometimes buys peanut M&Ms against their volition. On an agent-based level where an AI is no more or less capable than a human, this isn’t much of an issue, and such things could make individual AI agents more convincing.
  
  But if you want to make a superintelligent AI to run your ideal utopia, you don’t want it to decide to feed everyone peanut M&Ms against their will on a whim.
  
  The biggest issue is that we can’t determine the difference between “intentional action” and “unintentional response”. If we could, then it would then (according to her) be trivial to find out what the CEV of humanity is, no estimation needed.
  
  My largest assumption was that the lowest common denominator of human behaviour is “principled reasoning in pursuit of fixed, though unstated, goals”. More realistically, as another friend (and the post you linked) pointed out, the lowest common denominator of human behaviour is going to be “reproduce”, which has very unfortunate implications for the Friendliness of this hypothetical agent.
  
  A number of things could be done to ameliorate this, such as not including any means to reproduce or any data supporting reproduction in the trajectories, but they all seem inadequate or ad-hoc. I don’t want to staple together a bunch of things I barely understand and declare it the Solution To AI (not that I was attempting to do that, anyway), especially when the issue isn’t necessarily with the technology and theory. As the peanut-M&M-purchasing friend put, the technology is sufficient but this post overestimates humans. This wasn’t actually what I expected to have an issue on, and it shifts it from “improve technology and theories” to… what, “improve humans”? I’m at a loss as to where to go from here; inverse reinforcement learning has a demonstrable use-case and benefits, but the data is… not good. Garbage in gives garbage out. Is it really possible to improve human behaviour (or our analysis/collection of human behaviour) to achieve better results?
  - Rohin Shah 16 Apr 2019 17:35 UTC
    3 points
    Parent
    There’s a lot of speculation about related-ish topics in Chapter 3 of the sequence linked above.