TheAncientGeek comments on Heroin model: AI “manipulates” “unmanipulatable” reward

TheAncientGeek 22 Sep 2016 14:46 UTC
3 points
1. The idea of that more information can make an AI’s inferences worse is surprising. But the idea that the assumption that humans have a unchanging, neatly hierarchical UF is known to be a bad idea, so it is not so surprising that it leads to bad results. In short, this is still a bit clown-car-ish.
2. Would you tell an AI that Heroin is Bad, but not tell here that Manipulation is Bad?
- Stuart_Armstrong 23 Sep 2016 9:48 UTC
  2 points
  Parent
  1. Don’t worry, I’m going to be adding depth to the model. But note that the AI’s predictive accuracy is never in doubt. This is sort of a reverse “can’t derive an ought from as is”; here, you can’t derive a wants from a did. The learning agent will only get the correct human motivation (if such a thing exists) if it has the correct model of what counts as desires for a human. Or some way of learning this model, which is what I’m looking at (again, there’s a distinction between learning a model that gives correct prediction of human actions, and learning a mode that gives what we would call a correct model of human motivation).
  2. According to its model, the AI is not being manipulative here, simply doing what the human desires indicate it should.