So though the AI is motivated to learn, it’s also motivated to manipulate the learning process.
It seems like the problem here is that the prior probability that the human says “cake” depends on the AI’s policy. The update when seeing the human actually say “cake” isn’t a problem, due to conservation of expected evidence.
Under my (very incomplete) model of Everitt’s approach, the programmer will specify the prior over values (so the prior is independent of the AI’s policy), then disallow actions that would prevent the reward signal from being an unbiased estimate of the values.
It seems like the problem here is that the prior probability that the human says “cake” depends on the AI’s policy. The update when seeing the human actually say “cake” isn’t a problem, due to conservation of expected evidence.
Under my (very incomplete) model of Everitt’s approach, the programmer will specify the prior over values (so the prior is independent of the AI’s policy), then disallow actions that would prevent the reward signal from being an unbiased estimate of the values.