This might imply a predictive circuit for predicting the output of the antecedent-computation-reinforcer, but I don’t see why it implies internal reward-orientation motivational edifices.
Sorry, if I’m reading this right, we’re hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think “nah, it needs to take an action before that action can be rewarded”, and my response is “wait, isn’t this going to be straightforwardly encouraged by backpropagation?”
[I am slightly departing from Wei_Dai’s hypothetical in my line of reasoning here, as Wei is mostly focused on asking “don’t you expect this to come about in an introspective-reasoning powered way?” and I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”.]
I think that assuming there’s a feature-direction “care more about reward” which isn’t already gradient-starved by shallower proxies learned earlier in training. In my ontology, this corresponds to “thinking thoughts about reward in order to get reward.”
Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that ‘care more about reward’ is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).
If this argument works, why doesn’t it go through for people? (Not legibly a knockdown until we check that the mechanisms are sufficiently similar, but it’s at least a sanity check. I think the mechanisms are probably sufficiently similar, though.)
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”. Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”
I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”.
Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and ctrl+C ctrl+V for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.
Sorry, if I’m reading this right, we’re hypothesizing internal reward-orientation motivational edifices, and then asking the question of whether or not policy gradients will encourage them or discourage them. Quintin seems to think “nah, it needs to take an action before that action can be rewarded”, and my response is “wait, isn’t this going to be straightforwardly encouraged by backpropagation?”
[I am slightly departing from Wei_Dai’s hypothetical in my line of reasoning here, as Wei is mostly focused on asking “don’t you expect this to come about in an introspective-reasoning powered way?” and I’m mostly focused on asking “if this structure is present in the model initialization as one of the lottery tickets, won’t policy gradient encourage it?”.]
Cool, this feels like a real reason, but also substantially more contingent. Naively, I would expect that you could construct a training schedule such that ‘care more about reward’ is encouraged, and someone will actually try to do this (as part of making a zero-shot learner in RL environments).
I think we have some pre-existing disagreement about what we should conclude from human heroin addicts; you seem to think “yeah, it only happens sometimes” whereas my view is something more like “fuck, it happens sometimes”. Like, the thing where people don’t do heroin because they’ve heard other people downvote heroin addiction is not a strategy that scales to superintelligence.
I see. Can’t speak for Quintin, but: I mostly think it won’t be present, but also conditional on the motivational edifice being present, I expect the edifice to bid up rewarding actions and get reinforced into a substantial influence. I have a lot of uncertainty in this case. I’m hoping to work out a better mechanistic picture of how the gradients would affect such edifices.
I think there are a range of disagreements here, but also one man’s modus ponens is another’s modus tollens: High variance in heroin-propensity implies we can optimize heroin-propensity down to negligible values with relatively few bits of optimization (if we knew what we were doing, at least).
This isn’t obviously true to me, actually. That strategy certainly sounds quotidien, but is it truly mechanistically deficient? If we tell the early training-AGI “Hey, if you hit the reward button, the ensuing credit assignment will drift your values by mechanisms A, B, and C”, that provides important information to the AGI. I think that that’s convergently good advice, across most possible values the AGI could have. (This, of course, doesn’t address the problem of whether the AGI does have good values to begin with.)
More broadly, I suspect there might be some misconception about myself and other shard theory researchers. I don’t think, “Wow humans are so awesome, let’s go ahead and
ctrl+C ctrl+V
for alignment.” I’m very very against boxing confusion like that. I’m more thinking, “Wow, humans have pretty good general alignment properties; I wonder what the generators are for that?”. I want to understand the generators for the one example we have of general intelligences acquiring values over their lifetime, and then use that knowledge to color in and reduce my uncertainty about how alignment works.