If the agent doesn’t explore in the direction of answering “good”, then there’s no gradient in that direction.
Wait, I don’t think this is true? At least, I’d appreciate it being stepped thru in more detail.
In the simplest story, we’re imagining an agent whose policy is πθ and, for simplicity’s sake, θ0 is a scalar that determines “how much to maximize for reward” and all the other parameters of θ store other things about the dynamics of the world / decision-making process.
It seems to me that ∇θ is obviously going to try to point θ0 in the direction of “maximize harder for reward”.
In the more complicated story, we’re imagining an agent whose policy is πθ which involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let’s call it s0 like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that ∇θ is going to try to adjust θ such that the agent selects internal actions that point s0 in the direction of “maximize harder for reward”.
I think Quintin[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don’t ever have infinite exploration (and we certainly don’t have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).
This seems right to me and it’s a nuance I’ve raised in a few conversations in the past. On the other hand kind of half the point of RL optimisation algorithms is to do ‘enough’ exploration! And furthermore (as I mentioned under Steven’s comment) I’m not confident that such simplistic RL is the one that will scale to AGI first. cf various impressive results from DeepMind over the years which use lots of shenanigans besides plain old sample-based policy gradient estimation (including model-based lookahead as in the Alpha and Mu gang). But maybe!
so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).
This is the bit I don’t believe, actually. [Or at least don’t think is relevant.] Note that in Wei_Dai’s hypothetical, the neural net architecture has a particular arrangement such that “how much it optimizes for reward” is either directly or indirectly implied by the neural network weights. [We’re providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that doesn’t have access to that.]
Quintin seems to me to be arguing “if you actually follow the math, there isn’t a gradient to that parameter,” which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of “care more about reward.”
This doesn’t mean that, by caring about reward more, it knows which actions in the environment cause more reward. There I believe the story that the RL algorithm won’t be able to reinforce actions that have never been tried.
[EDIT: Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.]
Note that in Wei_Dai’s hypothetical, the neural net architecture has a particular arrangement such that “how much it optimizes for reward” is either directly or indirectly implied by the neural network weights. [We’re providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that.]
This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices. I can probably predict my own limbic reward outputs to some crude degree, but that doesn’t make me a reward optimizer.
Quintin seems to me to be arguing “if you actually follow the math, there isn’t a gradient to that parameter,” which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of “care more about reward.”
I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
In the simplest story, we’re imagining an agent whose policy is πθ and, for simplicity’s sake, θ0 is a scalar that determines “how much to maximize for reward” and all the other parameters of θ store other things about the dynamics of the world / decision-making process.
It seems to me that ∇θ is obviously going to try to point θ0 in the direction of “maximize harder for reward”.
Seems like we’re assuming the whole ball game away. You’re assuming the cognition is already set up so as to admit easy local refinements towards maximizing reward more, that this is where the gradient points. My current guess is that freshly initialized networks will not have gradients towards modelling and acting to increase the antecedent-computation-reinforcer register in the real world (nor would this be the parametric direction of maximal increase of P(rewarding actions) ).
For any observed data point in PG, you’re updating to make rewarding actions more probable given the policy network. There are many possible directions in which to increase P(rewarding actions), and internal reward valuation is only one particular direction. But if you’re already doing the “lick lollipops” action because you see a lollipop in front of you and have a hardcoded heuristic to grab it and lick it, then this starves any potential gradient (because you’re already taking the action of grabbing the lollipop).
Now, you might have a situation where the existing computation doesn’t get reward. But then policy gradient isn’t going to automatically “find” the bandit arm with even higher reward and then provide an exact gradient towards that action. PG is still reinforcing to increase the probability of historically rewarding actions. And you can easily hit gradient starvation there, I think.
Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.
If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices.
Sorry, if I’m reading this right, we’re hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think “nah, it needs to take an action before that action can be rewarded”, and my response is “wait, isn’t this going to be straightforwardly encouraged by backpropagation?”
[I am slightly departing from Wei_Dai’s hypothetical in my line of reasoning here, as Wei is mostly focused on asking “don’t you expect this to come about in an introspective-reasoning powered way?” and I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”.]
I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that ‘care more about reward’ is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).
If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”. Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”
I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”.
Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and ctrl+C ctrl+V for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.
Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”?
That’s my reading, yeah, and I agree it’s strained. But yes, the ‘internal action’ of even ‘thinking about how to’ optimise for reward may be not trivial to discover.
Separately, the action-weight downstream of that ‘thinking’ has to yield better actions than whatever the action results of the ‘rest of’ cognition are, to be reinforced (it stands to reason that they might, but plausibly heuristics amounting to ‘shaped’ value and reward proxies are easier to get right, hence inner misalignment).
I agree that once you find ways to directly seek reward you’re liable to get hooked to some extent.
I think this sort of thing is worth trying to get nuance on, but I certainly don’t personally derive much hope from it directly (I think this sort of reasoning may lead to useable insights though).
Wait, I don’t think this is true? At least, I’d appreciate it being stepped thru in more detail.
In the simplest story, we’re imagining an agent whose policy is πθ and, for simplicity’s sake, θ0 is a scalar that determines “how much to maximize for reward” and all the other parameters of θ store other things about the dynamics of the world / decision-making process.
It seems to me that ∇θ is obviously going to try to point θ0 in the direction of “maximize harder for reward”.
In the more complicated story, we’re imagining an agent whose policy is πθ which involves how it manipulates both external and internal actions (and thus both external and internal state). One of the internal state pieces (let’s call it s0 like last time) determines whether it selects actions that are more reward-seeking or not. Again I think it seems likely that ∇θ is going to try to adjust θ such that the agent selects internal actions that point s0 in the direction of “maximize harder for reward”.
What is my story getting wrong?
I think Quintin[1] is maybe alluding to the fact that in the limit of infinite counterfactual exploration then sure, the gradient in sample-based policy gradient estimation will push in that direction. But we don’t ever have infinite exploration (and we certainly don’t have counterfactual exploration; though we come very close in simulations with resets) so in pure non-lookahead (e.g. model free) sample-based policy gradient estimation, an action which has never been tried can not be reinforced (except as a side effect of generalisation by function approximation).
This seems right to me and it’s a nuance I’ve raised in a few conversations in the past. On the other hand kind of half the point of RL optimisation algorithms is to do ‘enough’ exploration! And furthermore (as I mentioned under Steven’s comment) I’m not confident that such simplistic RL is the one that will scale to AGI first. cf various impressive results from DeepMind over the years which use lots of shenanigans besides plain old sample-based policy gradient estimation (including model-based lookahead as in the Alpha and Mu gang). But maybe!
This is a guess and I haven’t spoken to Quintin about this—Quintin, feel free to clarify/contradict
This is the bit I don’t believe, actually. [Or at least don’t think is relevant.] Note that in Wei_Dai’s hypothetical, the neural net architecture has a particular arrangement such that “how much it optimizes for reward” is either directly or indirectly implied by the neural network weights. [We’re providing the reward as part of its observations, and so if nothing else the weights from that part of the input vector to deeper in the network will be part of this, but the actual mechanism is going to be more complicated for one that doesn’t have access to that.]
Quintin seems to me to be arguing “if you actually follow the math, there isn’t a gradient to that parameter,” which I find surprising, and which seems easy to demonstrate by going thru the math. As far as I can tell, there is a gradient there, and it points in the direction of “care more about reward.”
This doesn’t mean that, by caring about reward more, it knows which actions in the environment cause more reward. There I believe the story that the RL algorithm won’t be able to reinforce actions that have never been tried.
[EDIT: Maybe the argument is “but if it’s never tried the action of optimizing harder for reward, then the RL algorithm won’t be able to reinforce that internal action”? But that seems pretty strained and not very robust, as the first time it considers trying harder to get reward, it will likely get hooked.]
This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices. I can probably predict my own limbic reward outputs to some crude degree, but that doesn’t make me a reward optimizer.
I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
Seems like we’re assuming the whole ball game away. You’re assuming the cognition is already set up so as to admit easy local refinements towards maximizing reward more, that this is where the gradient points. My current guess is that freshly initialized networks will not have gradients towards modelling and acting to increase the antecedent-computation-reinforcer register in the real world (nor would this be the parametric direction of maximal increase of P(rewarding actions) ).
For any observed data point in PG, you’re updating to make rewarding actions more probable given the policy network. There are many possible directions in which to increase P(rewarding actions), and internal reward valuation is only one particular direction. But if you’re already doing the “lick lollipops” action because you see a lollipop in front of you and have a hardcoded heuristic to grab it and lick it, then this starves any potential gradient (because you’re already taking the action of grabbing the lollipop).
Now, you might have a situation where the existing computation doesn’t get reward. But then policy gradient isn’t going to automatically “find” the bandit arm with even higher reward and then provide an exact gradient towards that action. PG is still reinforcing to increase the probability of historically rewarding actions. And you can easily hit gradient starvation there, I think.
If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
Sorry, if I’m reading this right, we’re hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think “nah, it needs to take an action before that action can be rewarded”, and my response is “wait, isn’t this going to be straightforwardly encouraged by backpropagation?”
[I am slightly departing from Wei_Dai’s hypothetical in my line of reasoning here, as Wei is mostly focused on asking “don’t you expect this to come about in an introspective-reasoning powered way?” and I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”.]
Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that ‘care more about reward’ is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”. Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
I think there are a range of disagreements here, but also one man’s modus ponens is another’s modus tollens: High variance in heroin-propensity implies we can optimize heroin-propensity down to negligible values with relatively few bits of optimization (if we knew what we were doing, at least).
This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and
ctrl+C ctrl+V
for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.That’s my reading, yeah, and I agree it’s strained. But yes, the ‘internal action’ of even ‘thinking about how to’ optimise for reward may be not trivial to discover.
Separately, the action-weight downstream of that ‘thinking’ has to yield better actions than whatever the action results of the ‘rest of’ cognition are, to be reinforced (it stands to reason that they might, but plausibly heuristics amounting to ‘shaped’ value and reward proxies are easier to get right, hence inner misalignment).
I agree that once you find ways to directly seek reward you’re liable to get hooked to some extent.
I think this sort of thing is worth trying to get nuance on, but I certainly don’t personally derive much hope from it directly (I think this sort of reasoning may lead to useable insights though).