TurnTrout comments on Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

TurnTrout 7 Nov 2022 22:29 UTC
LW: 2 AF: 2
AF
To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading.
I think that’s not true. The point where you deal with wireheading probably isn’t what you reward so much as when you reward. If the agent doesn’t even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.
This isn’t a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models). And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point? (I know you noted its fuzziness, so maybe you’re already sympathetic to responses like the one I just gave?)
- abramdemski 9 Nov 2022 20:20 UTC
  LW: 2 AF: 2
  AF Parent
  I think that’s not true. The point where you deal with wireheading probably isn’t what you reward so much as when you reward. If the agent doesn’t even know about its training process, and its initial values form around e.g. making diamonds, and then later the AI actually learns about reward or about the training process, then the training-process-shard updates could get gradient-starved into basically nothing.
  I have a low-confidence disagreement with this, based on my understanding of how deep NNs work. To me, the tangent space stuff suggests that it’s closer in practice to “all the hypotheses are around at the beginning”—it doesn’t matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don’t change that much by introducing it at different stages in training.
  Plausibly this is true of some training setups and not others; EG, more true for LLMs and less true for RL.
  Let’s set aside the question of whether it’s true, though, and consider the point you’re making.
  This isn’t a rock-solid rebuttal, of course. But I think it illustrates that RL training stories admit ways to decrease P(bad hypotheses/behaviors/models).
  So I understand one of your major points to be: thinking about training as the chisel which shapes the policy doesn’t necessitate thinking in terms of incentives (ie gradients pushing in particular directions). The ultimate influence of a gradient isn’t necessarily the thing it immediately pushes for/against.
  I tentatively disagree based on the point I made earlier; plausibly the influence of a gradient step is almost exclusively its immediate influence.
  But I don’t disagree in principle with the line of investigation. Plausibly it is pretty important to understand this kind of evidence-ordering dependence. Plausibly, failure modes in value learning can be avoided by locking in specific things early, before the system is “sophisticated enough” to be doing training-process-simulation.
  I’m having some difficulty imagining powerful conceptual tools along those lines, as opposed to some relatively simple stuff that’s not that useful.
  And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point? (I know you noted its fuzziness, so maybe you’re already sympathetic to responses like the one I just gave?)
  I’m confused about what you mean here. My best interpretation is that you don’t think current RL systems are modeling the causal process whereby they get reward. On my understanding, this does not closely relate to the question of whether our understanding of training should focus on the first-order effects of gradient updates or should also admit higher-order, longer-term effects.
  Maybe on your understanding, the actual reason why current RL systems don’t wirehead too much, is because of training order effects? I would be surprised to come around on this point. I don’t see it.
  - TurnTrout 19 Nov 2022 2:07 UTC
    LW: 2 AF: 2
    AF Parent
    To me, the tangent space stuff suggests that it’s closer in practice to “all the hypotheses are around at the beginning”—it doesn’t matter very much which order the evidence comes in. The loss function is close to linear in the space where it moves, so the gradients for a piece of data don’t change that much by introducing it at different stages in training.
    This seems to prove too much in general, although it could be “right in spirit.” If the AI cares about diamonds, finds out about the training process but experiences no more update events in that moment, and then sets its learning rate to zero, then I see no way for the Update God to intervene to make the agent care about its training process.
    And one reason is that I don’t think that RL agents are managing motivationally-relevant hypotheses about “predicting reinforcements.” Possibly that’s a major disagreement point?
    I’m confused about what you mean here.
    I was responding to:
    To the extent that the training story is about ruling out bad hypotheses, or disincentivizing bad behaviors, or providing differentially more loss for undesirable models compared to more-desirable models, RL can’t do that with respect to the specific failure mode of wireheading. Because an accurate model of the process actually providing the reinforcements will always do at least as well in predicting those reinforcements as alternative models
    I bet you can predict what I’m about to say, but I’ll say it anyways. The point of RL is not to entrain cognition within the agent which predicts the reward. RL first and foremost chisels cognition into the network.
    So I think the statement “how well do the agent’s motivations predict the reinforcement event” doesn’t make sense if it’s cast as “manage a range of hypotheses about the origin of reward (e.g. training-process vs actually making diamonds).” I think it does make sense if you think about what behavioral influences (“shards”) within the agent will upweight logits on the actions which led to reward.