Steven Byrnes comments on Reward Is Not Enough

Steven Byrnes 17 Jun 2021 19:52 UTC
LW: 9 AF: 3
0
AF
how does it avoid wireheading
Um, unreliably, at least by default. Like, some humans are hedonists, others aren’t.
I think there’s a “hardcoded” credit assignment algorithm. When there’s a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I’m not sure of the gory details here.)
Anyway, insofar as “the reward signal itself” is part of the world-model, it’s possible that reward-prediction / value will wind up attached to that concept. And then that’s a desire to wirehead. But it’s not inevitable. Some of the relevant dynamics are:
- Timing—if credit goes mainly to signals that slightly precede the reward prediction error, then the reward signal itself is not a great fit.
- Explaining away—once you have a way to accurately predict some set of reward signals, it makes the reward prediction errors go away, so the credit assignment algorithm stops running for those signals. So the first good reward-predicting model gets to stick around by default. Example: we learn early in life that the “eating candy” concept predicts certain reward signals, and then we get older and learn that the “certain neural signals in my brain” concept predicts those same reward signals too. But just learning that fact doesn’t automatically translate into “I really want those certain neural signals in my brain”. Only the credit assignment algorithm can make a thought appealing, and if the rewards are already being predicted then the credit assignment algorithm is inactive. (This is kinda like the behaviorism concept of blocking.)
- There may be some kind of bias to assign credit to predictive models that are simple functions of sensory inputs, when such a model exists, other things equal. (I’m thinking here of the relation between amygdala predictions, which I think are restricted to relatively simple functions of sensory input, versus mPFC predictions, which I think can involve more abstract situational knowledge. I’m still kinda confused about how this works though.)
- There’s a difference between hedonism-lite (“I want to feel good, although it’s not the only thing I care about”) and hedonism-level-10 (“I care about nothing whatsoever except feeling good”). My model would suggest that hedonism-lite is widespread, but hedonism-level-10 is vanishingly rare or nonexistent, because it requires that somehow all value gets removed from absolutely everything in the world-model except that one concept of the reward signal.
For AGIs we would probably want to do other things too, like (somehow) use transparency to find “the reward signal itself” in the world-model and manually fix its reward-prediction / value at zero, or whatever else we can think of. Also, I think the more likely failure mode is “wireheading-lite”, where the desire to wirehead is trading off against other things it cares about, and then hopefully conservatism (section 2 here) can help prevent catastrophe.