Stuart_Armstrong comments on Heroin model: AI “manipulates” “unmanipulatable” reward

Stuart_Armstrong 23 Sep 2016 9:48 UTC
2 points
1. Don’t worry, I’m going to be adding depth to the model. But note that the AI’s predictive accuracy is never in doubt. This is sort of a reverse “can’t derive an ought from as is”; here, you can’t derive a wants from a did. The learning agent will only get the correct human motivation (if such a thing exists) if it has the correct model of what counts as desires for a human. Or some way of learning this model, which is what I’m looking at (again, there’s a distinction between learning a model that gives correct prediction of human actions, and learning a mode that gives what we would call a correct model of human motivation).
2. According to its model, the AI is not being manipulative here, simply doing what the human desires indicate it should.