Gordon Seidoh Worley comments on Minimization of prediction error as a foundation for human values in AI alignment

Gordon Seidoh Worley 10 Oct 2019 21:37 UTC
2 points
Evolved Agents Probably Don’t Minimize Prediction Error
If we look at the field of reinforcement learning, it appears to be generally useful to add intrinsic motivation for exploration to an agent. This is the exact opposite of predictability: in one case we add reward for entering unpredictable states, whereas in the other case we add reward for entering predictable states. I’ve seen people try to defend minimizing prediction error by showing that the agent is still motivated to learn (in order to figure out how to avoid unpredictability). However, the fact remains: it is still motivated to learn strictly less than an unpredictability-loving agent. RL has, in practice, found it useful to add reward for unpredictability; this suggests that evolution might have done the same, and suggests that it would not have done the exact opposite. Agents operating under a prediction-error penalty would likely under-explore.
I ended up replying to this in a separate post since I felt like similar objections kept coming up. My short answer is: minimization of prediction error is minimization of error at predicting input to a control system that may not be arbitrarily free to change its prediction set point. This means that it won’t always be the case that a control system is globally trying to minimize prediction error, but instead is locally trying to minimize prediction error, although it may not be able to become less wrong over time because it can’t change the prediction to better predict the input.
From an evolutionary perspective my guess is that true Bayesian updating is a fairly recent adaptation, and most minimization of prediction error is minimization of error of mostly fixed prediction set points that are beneficial for survival.
- abramdemski 11 Oct 2019 8:27 UTC
  4 points
  Parent
  I left a reply to this view at the other comment. However, I don’t feel that point connects very well to the point I tried to make.
  Your OP talks about minimization of prediction error as a theory of human value, relevant to alignment. It might be that evolution re-purposes predictive machinery to pursue adaptive goals; this seems like the sort of thing evolution would do. However, this leaves the question of what those goals are. You say you’re not claiming that humans globally minimize prediction error. But, partly because of the remarks you made in the OP, I’m reading you as suggesting that humans do minimize prediction error, but relative to a skewed prediction.
  Are human values well-predicted by modeling us as minimizing prediction error relative to a skewed prediction?
  My argument here is that evolved creatures such as humans are more likely to (as one component of value) steer toward prediction error, because doing so tends to lead to learning, which is broadly valuable. This is difficult to model by taking a system which minimizes prediction error and skewing the predictions, because it is the exact opposite.
  Elsewhere, you suggest that exploration can be predicted by your theory if there’s a sort of reflection within the system, so that prediction error is predicted as well. The system therefore has an overall set-point for prediction error and explores if it’s too small. But I think this would be drowned out. If I started with a system which minimizes prediction error and added a curiosity drive on top of it, I would have to entirely cancel out the error-minimization drive before I started to see the curiosity doing its job successfully. Similarly for your hypothesized part. Everything else in the system is strategically avoiding error. One part steering toward error would have to out-vote or out-smart all those other parts.
  Now, that’s over-stating my point. I don’t think human curiosity drive is exactly seeking maximum prediction error. I think it’s more likely related to the derivative of prediction error. But the point remains that that’s difficult to model as minimization of a skewed prediction error, and requires a sub-part implementing curiosity to drown out all the other parts.
  Instead of modeling human value as minimization of error of a skewed prediction, why not step back and model it as minimizing “some kind of error”? This seems no less parsimonious (since you have to specify the skew anyway), and leaves you with all the same controller machinery to propagate error through the system and learn to avoid it.