David Johnston comments on Order Matters for Deceptive Alignment

David Johnston Feb 24, 2023, 4:20 AM
1 point
0
I don’t think it’s obvious that a learner which initially learns to pursue the right thing within each episode is also likely to learn to steer itself towards doing the right thing across episodes.
- cfoster0 Feb 24, 2023, 4:57 AM
  1 point
  0
  Parent
  I wasn’t assuming we were working in an episodic context. But if we are, then the agent isn’t getting differential feedback on its across-episode generalization behavior, so there is no reason for the model to develop dominant policy circuits that respond to episode boundaries (any more than ones that respond to some random perceptual bitvector).
  - David Johnston Feb 24, 2023, 5:33 AM
    1 point
    0
    Parent
    Let me try again. It initially develops a reasonable goal because it doesn’t yet know a) playing the training game is instrumentally correct for a bunch of other goals and/or b) how to play the training game in a way that is instrumentally correct for these otehr goals.
    
    By the same assumptions, it seems reasonable to speculate that it doesn’t know a) allowing itself to change in certain ways from gradient updates would be un-instrumental and/or b) how to prevent itself from changing in these ways. Furthermore it’s not obvious that the “reasonable goal” it learns will incentivise figuring this out later.
    - cfoster0 Feb 24, 2023, 7:08 AM
      1 point
      0
      Parent
      In my model, having a goal of X means the agent’s policy circuits are differentially sensitive to (a.k.a. care about) the features in the world that it recognizes as relevant to X (as represented in the agent’s own ontology), because that recognition was selected for in the past. If it has a “reasonable goal” of X, then it doesn’t necessarily want to “play the training game” instrumentally for an alternative goal Y, at least not along dimensions where Y diverges from X, even if it knows how to do so and Y is another (more) highly-rewarded goal. If it does “play the training game”, it wants to play the training game in the service of X.
      If the agent never gets updated again, its policy circuits stay as they are, so its goal remains fixed at X. Otherwise, since we know X-seeking is a rewarded strategy (otherwise, how did we entrain X in the first place?), and since its existing policy is X-seeking, new updates will continue to flow from rewards that it gets from trying to seek X (modulo random drift from accidental rewards). By default^[1] those updates move it around within the basin of X-seeking policies, rather than somehow teleporting it into another basin that has high reward for X-unrelated reasons. So I think that if we got a reasonable X goal early on, the goal will tend to stay X or something encompassing X. Late in training, once the agent is situationally-aware, it can start explicitly reasoning about what actions will keep its current “reasonable” goal intact.
      ^
      To give an example, if my thoughts have historically been primarily shaped by rewards flowing from having “eating apples” as my goal, then by default I’m going to persist in apple-seeking and apple-eating behavior/cognition (unless some event strongly and differentially downweights that cognition, like getting horribly sick from a bad apple), which will tend to steer me towards more apples and apple-based reinforcement events, which will further solidify this goal in me.