[EDIT: see my response to this comment; this one is at least mildly confused]
[Again, I want to flag that this line of thinking/disagreement is not the most interesting part of what you/Quintin are saying overall—the other stuff I intend to think more about; nonetheless, I do think it’s important to get to the bottom of the disagreement here, in case anything more interesting hinges upon it]
[JC: There isn’t an objective human reward signal that mirrors an RL agent’s reward.]
You’re the second person to confidently have this reaction, and I’m pretty confused why.
My objection here is all in the ”...that mirrors an RL agent’s reward.”—that’s where the parallel doesn’t work in my view. An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
I agree with the following:
The brain implements an outer criterion which evaluates and reinforces behavior/predictions and incentivizes some plans over others along different dimensions.
I just don’t think this tells us anything useful, since this criterion clearly is not maximisation of total discounted reward. (though I would expect some correlation)
It seems to me that the criterion is more like maximisation of in-the-moment reward (I’m using ‘reward’ here very broadly). I.e. I might work rather than have fun since the thought of working happened to be more ‘rewarding’ than the thought of having fun. (similarly, I might not wirehead, since the thought of wireheading is negative)
This seems essentially vacuous, because I don’t see a way to measure itm-reward better than: if I did x rather than y, then x was more itm-rewarding than y. (to be clear, I’m saying this is not useful—but that I don’t see a principled definition of itm-reward that doesn’t amount to this; this is where a “crisp and clear mechanistic notion of what counted as human reward” would be handy—in order to come up with a non-vacuous definition)
Perhaps it’s clearer if I back up to your previous post and state a crisper disagreement:
If you don’t want to wirehead, you are not trying to optimize the objective encoded by the steering system in your own brain, and that’s an inner alignment failure with respect to that system.
This just seems wrong to me. The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
In an RL system these two are similar, precisely because the RL system is designed to steer towards outcomes with high total discounted reward according to its own metric.
In general, steering systems are not like this. The criterion for picking one plan over another can be [expected total reward] or [something entirely different].
Where a system doesn’t use [expected total reward] it seems just plain silly to me to call behaviour misaligned where it doesn’t match [what the system would incentivize if it did use expected total reward]. Of course it doesn’t match, since that’s not how this steering system works.
In this context, I mean the “steering system” to refer to the genetically hardcoded reward circuitry which provides intrinsic rewards when certain hardcoded preconditions are met. It isn’t learned. Maybe that’s part of the confusion?
An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
An RL agent is reinforced for maximizing reward, but unless it has already fulfilled the prophecy of a convergence guarantee or unless it’s doing model-based brute-force planning to maximize reward over its time horizon, the RL agent is not actually maximizing reward, nor is it necessarily trying to maximize total reward.
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
I don’t understand why you hold this view. We probably are talking past each other?
EG if I just have a crude sugar reward circuit in my brain which activates when I am hungry and my taste buds signal the brain in the right way, and then I learn to like licking real-world lollipops (because that’s the only way I was able to stimulate the circuit on training when my values were forming), then the objective encoded by the reward circuit is… lollipop-licking in real life? But also, if I had only been exposed to chocolate on training, I would have learned to like eating chocolate. But also, if I had only been exposed to electrical taste bud stimulation on training, I would have learned to like electrical stimulation.
IMO the objective encoded by the reward circuit is the maximization of its own activations, that’s the optimal policy.
Anyways, I think it would just make more sense for me to link you to a Gdoc explaining my views. PM’d.
Ok, putting my [maybe I’m missing the point] hat on, it strikes me that the above is considering the learned steering system—which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I’d stick to saying that:
The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce]
But here I’d need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this—but at least this). There’s no mechanism in humans to sample in such a way that we’d expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn’t ‘care’ about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn’t likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Under [learning conditions], we expect system x to maximise y in the limit.
System x does not robustly learn to pursue y (rather than a proxy for y), so that under [different conditions] x no longer maximises y.
Here I don’t think we have (1), since we don’t expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it’s now clear where I’m coming from—even if I am confused!
My guess is that this doesn’t matter much to your/Quintin’s broader points(?) - beyond that “inner alignment failure” may not be the best description.
[EDIT: see my response to this comment; this one is at least mildly confused]
[Again, I want to flag that this line of thinking/disagreement is not the most interesting part of what you/Quintin are saying overall—the other stuff I intend to think more about; nonetheless, I do think it’s important to get to the bottom of the disagreement here, in case anything more interesting hinges upon it]
My objection here is all in the ”...that mirrors an RL agent’s reward.”—that’s where the parallel doesn’t work in my view. An RL agent is trained to maximize total (discounted) reward. The brain isn’t maximizing total reward, nor trying to maximize total reward, nor is evolution acting on the basis that it’ll do either of these things.
I agree with the following:
I just don’t think this tells us anything useful, since this criterion clearly is not maximisation of total discounted reward. (though I would expect some correlation)
It seems to me that the criterion is more like maximisation of in-the-moment reward (I’m using ‘reward’ here very broadly). I.e. I might work rather than have fun since the thought of working happened to be more ‘rewarding’ than the thought of having fun. (similarly, I might not wirehead, since the thought of wireheading is negative)
This seems essentially vacuous, because I don’t see a way to measure itm-reward better than: if I did x rather than y, then x was more itm-rewarding than y. (to be clear, I’m saying this is not useful—but that I don’t see a principled definition of itm-reward that doesn’t amount to this; this is where a “crisp and clear mechanistic notion of what counted as human reward” would be handy—in order to come up with a non-vacuous definition)
Perhaps it’s clearer if I back up to your previous post and state a crisper disagreement:
This just seems wrong to me. The [objective encoded by the steering system] is not [maximisation of the score assigned by the steering system], but rather [whatever behaviour the steering system tends to produce].
In an RL system these two are similar, precisely because the RL system is designed to steer towards outcomes with high total discounted reward according to its own metric.
In general, steering systems are not like this. The criterion for picking one plan over another can be [expected total reward] or [something entirely different].
Where a system doesn’t use [expected total reward] it seems just plain silly to me to call behaviour misaligned where it doesn’t match [what the system would incentivize if it did use expected total reward]. Of course it doesn’t match, since that’s not how this steering system works.
In this context, I mean the “steering system” to refer to the genetically hardcoded reward circuitry which provides intrinsic rewards when certain hardcoded preconditions are met. It isn’t learned. Maybe that’s part of the confusion?
An RL agent is reinforced for maximizing reward, but unless it has already fulfilled the prophecy of a convergence guarantee or unless it’s doing model-based brute-force planning to maximize reward over its time horizon, the RL agent is not actually maximizing reward, nor is it necessarily trying to maximize total reward.
I don’t understand why you hold this view. We probably are talking past each other?
EG if I just have a crude sugar reward circuit in my brain which activates when I am hungry and my taste buds signal the brain in the right way, and then I learn to like licking real-world lollipops (because that’s the only way I was able to stimulate the circuit on training when my values were forming), then the objective encoded by the reward circuit is… lollipop-licking in real life? But also, if I had only been exposed to chocolate on training, I would have learned to like eating chocolate. But also, if I had only been exposed to electrical taste bud stimulation on training, I would have learned to like electrical stimulation.
IMO the objective encoded by the reward circuit is the maximization of its own activations, that’s the optimal policy.
Anyways, I think it would just make more sense for me to link you to a Gdoc explaining my views. PM’d.
Ok, putting my [maybe I’m missing the point] hat on, it strikes me that the above is considering the learned steering system—which is the outcome of any misalignment. So I probably am missing your point there (I think?). Oops.
However, I still think I’d stick to saying that:
But here I’d need to invoke properties of the original steering system (ignoring the handwaviness of what that means for now), rather than the learned steering system.
I think what matters at that point is sampling of trajectories (perhaps not only this—but at least this). There’s no mechanism in humans to sample in such a way that we’d expect maximisation of reward to be learned in the limit. Neither would we expect one, since evolution doesn’t ‘care’ about reward maximisation.
Absent such a sampling mechanism, the objective encoded isn’t likely to be maximisation of the reward.
To talk about inner misalignment, I think we need to be able to say something like:
Under [learning conditions], we expect system x to maximise y in the limit.
System x does not robustly learn to pursue y (rather than a proxy for y), so that under [different conditions] x no longer maximises y.
Here I don’t think we have (1), since we don’t expect the human system to learn to maximise reward (or minimise regret, or...) in the limit (i.e. this is not the objective encoded by their original steering system).
Anyway, hopefully it’s now clear where I’m coming from—even if I am confused!
My guess is that this doesn’t matter much to your/Quintin’s broader points(?) - beyond that “inner alignment failure” may not be the best description.