TurnTrout comments on Reward is not the optimization target

TurnTrout 15 Aug 2022 6:05 UTC
LW: 3 AF: 3
1
AF
Note that in Wei_Dai’s hypothetical, the neural net architecture has a particular arrangement such that “how much it optimizes for reward” is either directly or indirectly implied by the neural network weights. [We’re providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that.]
This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices. I can probably predict my own limbic reward outputs to some crude degree, but that doesn’t make me a reward optimizer.
Quintin seems to me to be arguing “if you actually follow the math, there isn’t a gradient to that parameter,” which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of “care more about reward.”
I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
In the simplest story, we’re imagining an agent whose policy is $π_{θ}$ and, for simplicity’s sake, $θ_{0}$ is a scalar that determines “how much to maximize for reward” and all the other parameters of $θ$ store other things about the dynamics of the world / decision-making process.
It seems to me that $\nabla_{θ}$ is obviously going to try to point $θ_{0}$ in the direction of “maximize harder for reward”.
Seems like we’re assuming the whole ball game away. You’re assuming the cognition is already set up so as to admit easy local refinements towards maximizing reward more, that this is where the gradient points. My current guess is that freshly initialized networks will not have gradients towards modelling and acting to increase the antecedent-computation-reinforcer register in the real world (nor would this be the parametric direction of maximal increase of P(rewarding actions) ).
For any observed data point in PG, you’re updating to make rewarding actions more probable given the policy network. There are many possible directions in which to increase P(rewarding actions), and internal reward valuation is only one particular direction. But if you’re already doing the “lick lollipops” action because you see a lollipop in front of you and have a hardcoded heuristic to grab it and lick it, then this starves any potential gradient (because you’re already taking the action of grabbing the lollipop).
Now, you might have a situation where the existing computation doesn’t get reward. But then policy gradient isn’t going to automatically “find” the bandit arm with even higher reward and then provide an exact gradient towards that action. PG is still reinforcing to increase the probability of historically rewarding actions. And you can easily hit gradient starvation there, I think.
Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.
If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
- Vaniver 15 Aug 2022 18:34 UTC
  LW: 4 AF: 3
  0
  AF Parent
  This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices.
  Sorry, if I’m reading this right, we’re hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think “nah, it needs to take an action before that action can be rewarded”, and my response is “wait, isn’t this going to be straightforwardly encouraged by backpropagation?”
  [I am slightly departing from Wei_Dai’s hypothetical in my line of reasoning here, as Wei is mostly focused on asking “don’t you expect this to come about in an introspective-reasoning powered way?” and I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”.]
  I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
  Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that ‘care more about reward’ is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).
  If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
  I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”. Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
  - TurnTrout 22 Aug 2022 20:15 UTC
    LW: 5 AF: 3
    0
    AF Parent
    I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”
    I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
    I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”.
    I think there are a range of disagreements here, but also one man’s modus ponens is another’s modus tollens: High variance in heroin-propensity implies we can optimize heroin-propensity down to negligible values with relatively few bits of optimization (if we knew what we were doing, at least).
    Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
    This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
    More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and ctrl+C ctrl+V for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.