I think that both the easy and hard problem of wireheading are predicated on 1) a misunderstanding of RL (thinking that reward is—or should be—the optimization target of the RL agent) and 2) trying to black-box human judgment instead of just getting some good values into the agent’s own cognition. I don’t think you need anything mysterious for the latter. I’m confident that RLHF, done skillfully, does the job just fine. The questions there would be more like “what sequence of reward events will reinforce the desired shards of value within the AI?” and not “how do we philosophically do some fancy framework so that the agent doesn’t end up hacking its sensors or maximizing the quotation of our values?”.
I think I don’t understand what you mean by (2), and as a consequence, don’t understand the rest of this paragraph?
WRT (1), I don’t think I was being careful about the distinction in this post, but I do think the following:
The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It’s a false explanation of why wireheading is a concern.
The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).
To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading.
I think that’s not true. The point where you deal with wireheading probably isn’t what you reward so much as when you reward. If the agent doesn’t even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.
This isn’t a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point? (I know you noted its fuzziness, so maybe you’re already sympathetic to responses like the one I just gave?)
I think that’s not true. The point where you deal with wireheading probably isn’t what you reward so much as when you reward. If the agent doesn’t even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.
I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that it’s closer in practice to “all the hypotheses are around at the beginning”—it doesn’t matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don’t change that much by introducing it at different stages in training.
Plausibly this is true of some training setups and not others; EG, more true for LLMs and less true for RL.
Let’s set aside the question of whether it’s true, though, and consider the point you’re making.
This isn’t a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models).
So I understand one of your major points to be: thinking about training as the chisel which shapes the policy doesn’t necessitate thinking in terms of incentives (ie gradients pushing in particular directions). The ultimate influence of a gradient isn’t necessarily the thing it immediately pushes for/against.
I tentatively disagree based on the point I made earlier; plausibly the influence of a gradient step is almost exclusively its immediate influence.
But I don’t disagree in principle with the line of investigation. Plausibly it is pretty important to understand this kind of evidence-ordering dependence. Plausibly, failure modes in value learning can be avoided by locking in specific things early, before the system is “sophisticated enough” to be doing training-process-simulation.
I’m having some difficulty imagining powerful conceptual tools along those lines, as opposed to some relatively simple stuff that’s not that useful.
And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point? (I know you noted its fuzziness, so maybe you’re already sympathetic to responses like the one I just gave?)
I’m confused about what you mean here. My best interpretation is that you don’t think current RL systems are modeling the causal process whereby they get reward. On my understanding, this does not closely relate to the question of whether our understanding of training should focus on the first-order effects of gradient updates or should also admit higher-order, longer-term effects.
Maybe on your understanding, the actual reason why current RL systems don’t wirehead too much, is because of training order effects? I would be surprised to come around on this point. I don’t see it.
To me, the tangent space stuff suggests that it’s closer in practice to “all the hypotheses are around at the beginning”—it doesn’t matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don’t change that much by introducing it at different stages in training.
This seems to prove too much in general, although it could be “right in spirit.” If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process.
And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point?
I’m confused about what you mean here.
I was responding to:
To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models
I bet you can predict what I’m about to say, but I’ll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network.
So I think the statement “how well do the agent’s motivations predict the reinforcement event” doesn’t make sense if it’s cast as “manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds).” I think it does make sense if you think about what behavioral influences (“shards”) within the agent will upweight logits on the actions which led to reward.
I think I don’t understand what you mean by (2), and as a consequence, don’t understand the rest of this paragraph?
WRT (1), I don’t think I was being careful about the distinction in this post, but I do think the following:
The problem of wireheading is certainly not that RL agents are trying to take control of their reward feedback by definition; I agree with your complaint about Daniel Dewey as quoted. It’s a false explanation of why wireheading is a concern.
The problem of wireheading is, rather, that none of the feedback the system gets can disincentivize (ie, provide differentially more loss for) models which are making this mistake. To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models (assuming similar competence levels in both, of course, which I admit is a bit fuzzy).
I think that’s not true. The point where you deal with wireheading probably isn’t what you reward so much as when you reward. If the agent doesn’t even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.
This isn’t a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point? (I know you noted its fuzziness, so maybe you’re already sympathetic to responses like the one I just gave?)
I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that it’s closer in practice to “all the hypotheses are around at the beginning”—it doesn’t matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don’t change that much by introducing it at different stages in training.
Plausibly this is true of some training setups and not others; EG, more true for LLMs and less true for RL.
Let’s set aside the question of whether it’s true, though, and consider the point you’re making.
So I understand one of your major points to be: thinking about training as the chisel which shapes the policy doesn’t necessitate thinking in terms of incentives (ie gradients pushing in particular directions). The ultimate influence of a gradient isn’t necessarily the thing it immediately pushes for/against.
I tentatively disagree based on the point I made earlier; plausibly the influence of a gradient step is almost exclusively its immediate influence.
But I don’t disagree in principle with the line of investigation. Plausibly it is pretty important to understand this kind of evidence-ordering dependence. Plausibly, failure modes in value learning can be avoided by locking in specific things early, before the system is “sophisticated enough” to be doing training-process-simulation.
I’m having some difficulty imagining powerful conceptual tools along those lines, as opposed to some relatively simple stuff that’s not that useful.
I’m confused about what you mean here. My best interpretation is that you don’t think current RL systems are modeling the causal process whereby they get reward. On my understanding, this does not closely relate to the question of whether our understanding of training should focus on the first-order effects of gradient updates or should also admit higher-order, longer-term effects.
Maybe on your understanding, the actual reason why current RL systems don’t wirehead too much, is because of training order effects? I would be surprised to come around on this point. I don’t see it.
This seems to prove too much in general, although it could be “right in spirit.” If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process.
I was responding to:
I bet you can predict what I’m about to say, but I’ll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network.
So I think the statement “how well do the agent’s motivations predict the reinforcement event” doesn’t make sense if it’s cast as “manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds).” I think it does make sense if you think about what behavioral influences (“shards”) within the agent will upweight logits on the actions which led to reward.