JenniferRM comments on Models Don’t “Get Reward”

JenniferRM 20 Jan 2023 4:05 UTC
4 points
0
I think that certain Reinforcement Learning setups work in the “selectionist” way you’re talking about, but that also there are ALSO ways to get “incentivist” models.
The key distinction would be whether (1) the reward signals are part of the perceptual environment or (2) are sufficiently simplistic relative to the pattern matching systems that the system can learn to predict rewards very tightly as part of learning to maximize the overall reward.
Note that the second mode is basically “goodharting” the “invisible” reward signals that were probably intended by the programmers to be perceptually inaccessible (since they didn’t put them in the percepts)!
You could think of (idealized fake thought-experiment) humans has having TWO kinds of learning and intention formation.
One kind of RL-esque learning might happens “in dreams during REM” and the other could happen “moment to moment, via prediction and backchaining, like a chess bot, in response to pain and pleasure signals that are perceptible the way seeing better or worse scores for future board states based on material-and-so-on are perceptible”.
You could have people who have only “dream learning” who never consciously “sense pain” as a raw percept day-to-day and yet who learn to avoid it slightly better every night, via changes to their habitual patterns of behavior that occur during REM. This would be analogous to “selectionist RL”.
You could also have people who have only “pain planning” who always consciously “sense pain” and have an epistemic engine that gets smarter and smarter, plus a deep (exogenous? hard-coded?) inclination to throw planning routines and memorized wisdom at the problem of avoiding pain better each day. If their planning engine learns new useful things very fast, they could even better over the course of short periods of time within a single day or a single tiny behavioral session that includes looking and finding and learning and then changing plans. This would be analogous to “incentivist RL”.
The second kind is probably helpful in speeding up learning so that we don’t waste signals.
If pain is tallied up for use during sleep updates, then it could be wasteful to deprive other feedback systems of this same signal, once it has already been calculated.
Also, if the reward signal that is perceptible is very very “not fake” then creating “inner optimizers” that have their own small fast signal pursuing routines might be exactly what the larger outer dream loop would do, as an efficient want to get efficient performance. (The non-fakeness would protect against goodharting.)
(Note: you’d expect antagonistic pleiotropy here in long lived agents! The naive success/failure pattern would be that it is helpful for kid to learn fast from easy simple happiness and sadness… and dangerous for the elderly to be slaves to pleasure or pain.)
Phenomenologically: almost all real humans perceive pain and can level up their skills in new domains over the course of minutes and hours of practice with brand new skill domains.
This suggests that something like incentivist RL is probably built in to humans, and is easy for us to imagine or empathize with, and is probably a thing our minds attend to by default.
Indeed that might be that we “have mechanically aware and active and conscious minds at all” for this explicit planning loop to be able to work?
So it would be an easy “mistake to make” to think that this is how “all Reinforcement Learning algorithms” would “feel from the inside” <3
However, how does our pain and pleasure system stay so calibrated? Is that second less visible outer reward loop actually part of how human learning also “actually works”?
Note that above I mentioned an “(exogenous? hard-coded?) inclination to throw planning routines and memorized wisdom at the problem of avoiding pain” that was a bit confusing!
Where does that “impulse to plan” come from?
How does “the planner” decide how much effort to throw at each perceptual frustration or perceivable pleasure? When or why does the planner “get bored” and when does it “apply grit”?
Maybe that kind of “subjectively invisible” learning comes from an outer loop that IS in fact IN HUMANS?
We know that dreaming does seem to cause skill improvement. Maybe our own version of selectionist reinforcement (if it exists) would be operating to cause to be normally sane and normally functional humans from day to day… in a way that is just as “moment-to-moment invisible” to us as it might be to algorithms?
And we mostly don’t seem to fall into wireheading, which is kind of puzzling if you reason things out from first principles and predict the mechanistically stupid behavior that a pain/pleasure signal would naively generate...
NOTE that it seems quite likely to me that a sufficiently powerful RL engine that was purely selectionist (with reward signals intentionally made invisible to the online percepts of the model) that got very simple rewards applied for very simple features of a given run… would probably LEARN to IMAGINE those rewards and invent weights that implement “means/ends reasoning”, and invent “incentivist behavioral patterns” aimed at whatever rewards it imagines?
That is: in the long run, with lots of weights and training time, and a simple reward function, inner optimizers with implicitly perceivable rewards wired up as “perceivable to the inner optimizer” are probably default solutions to many problems.
HOWEVER… I’ve never seen anyone implement BOTH these inner and outer loops explicitly, or reason about their interactions over time as having the potential to detect and correct goodharting!
Presumably you could design a pleasure/pain system that is, in fact, perceptually available, on purpose?
Then you could have that “be really real” in that they make up PART of the “full true reward”...
...but then have other parts of the total selectionist reward signal only be generated and applied by looking at the gestalt story of the behaviors and their total impact (like whether they caused a lot of unhelpful ripples in parts of the environment that the agent didn’t and couldn’t even see at the time of the action).
If some of these simple reward signals are mechanistic (and online perceptible to the model) then they could also be tunable, and you could actually tune them via the holistic rewards in a selectionist RL way.
Once you have the basic idea of “have there be two layers, with the the broader slower less accessible one tuning the narrower faster more perceptible one” a pretty obvious thought would be to put an even slower and broader layer on top of those!
A lot of hierarchical Bayesian models get a bunch of juice from the first extra layer, but by the time you have three or four layers the model complexity stops being worth the benefits to the loss function.
I wonder if something similar might apply here?
Maybe after you have “hierarchical stacks of progressively less perceptually accessible post-hoc selectionist RL updates to hyper-parameters”...
...maybe the third or fourth or fifth layer of hyper-parameter tuning like this just “magically discovers the solution to the goodharting problem” from brute force application of SGD?
That feels like it would be “crazy good luck” from a Friendliness research perspective. A boon from the heavens! Therefore it probably can’t work for some reason <3
Yet also it doesn’t feel like a totally insane prediction for how the modeling and training might actually end up working?
No one knows what science doesn’t know, and so it could be that someone else has already had this idea. But this idea is NEW TO ME :-)
Has anyone ever heard of this approach to solving the goodhart problem being tried already?