TurnTrout comments on Behavioural statistics for a maze-solving agent

TurnTrout 24 Apr 2023 17:37 UTC
LW: 3 AF: 3
0
AF
Do you think that a larger network, a different architecture, and sufficient RL would be capable of learning an explicit representation of a more “consequentialist” algorithm for maze-solving, like Dijkstra’s algorithm or A*?
Or do you think that this is ruled out, absent a radically different training process or system architecture?
This is a good question to think about. I think this possibility is basically ruled out, unless you change the architecture quite a bit. Search is very awkward to represent in deep conv nets, AFAICT.
Concretely, I think these models are plenty “strong” at this task:
- The networks are trained until they get the cheese in nearly every episode.
- The model we studied is at least 4x overparameterized for the training task. Langosco et al. trained a model which has a quarter the channels at each layer in the network. This network also converges to getting the cheese every time.
- Our e.g. cheese-vector analysis qualitatively holds for a range of agents with different training distributions (trained with cheese in the top-right nxn corner, for $n = 2, . . ., 15$ ). Inspecting the vector fields, they are all e.g. locally attracted by cheese. Even the 15x15 agent goes to the top-right corner sometimes! (Levels are at most 25x25; a 15x15 zone is, in many maze sizes, equivalent to “the cheese can be anywhere.”)
- Uli retrained his own networks (same architecture) and found them to exhibit similar behavior.
I don’t think there are commensurably good reasons to think this model is too weak. More speculatively, I don’t think people would have predicted in advance that it’s too weak, either. For example, before even considering any experiments, I watched the trajectory videos and looked at the reward curves in the paper. I personally did not think the network was too weak. It seemed very capable to me, and I still think it is.
consequentialist
I think shard agents can be very consequentialist. There’s a difference between “making decisions on the basis of modelled consequences” and “implementing a relatively crisp search procedure to optimize a fixed-across-situations objective function.”^[1] “Utility-theoretic” is more the latter, of the agent being well-modelled as doing the latter.
I’ll make the stronger but less concrete prediction that the consequentialist view begins to outperform the shard view in predictive accuracy at or even below human-level capabilities (in general, not just for maze solving).
I think this is wrong, and am willing to make a bet here if you want. I think that even e.g. GPT-4 is not well-predicted by a consequentialist view, and is well-predicted by modelling it as having shards. I also think the shard view outperforms the consequentialist view for humans themselves, who I currently think are relatively learned-from-scratch and trained via somewhat (but not perfectly) similar training processes.
1. ^
  We run into obstacles here because this is probably not what some people mean when they use utility-theoretic. I sometimes don’t know what internal motivational structures people are positing by “utility-theoretic”, and am happy to consider the predictions made by any specific alternative grounding.
- Max H 26 Apr 2023 14:39 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Thanks for the thoughtful response (and willingness to bet)!
  Do you think that your results would replicate if applied to DreamerV3 trained on the same task? That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
  Dreamer is comprised of 3 networks (modeler, critic, and actor), but it is still a general reinforcement learning algorithm, so I think this is a relatively straightforward / fair comparison.
  My prediction is no: specifically, given the same training environment, dreamer would converge more quickly to getting the cheese every time in training, and then find the cheese more often than 69% of the time in test (maybe much more, or always).
  I’d be willing to bet on this, and / or put up a small bounty for you or someone else to actually run this experiment, if you have a different prediction.
  (For anyone who wants to try this: the code for dreamer and the cheese task are available, and the cheese task is a gym, so it should be relatively straightforward to run this experiment.)
  OTOH, if you agree with my prediction, I’d be interested in hearing why you think this isn’t a problem for shard theory. In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
  - TurnTrout 2 May 2023 3:30 UTC
    LW: 3 AF: 3
    0
    AF Parent
    Do you think that your results would replicate if applied to DreamerV3 trained on the same task?
    Which results? Without having read more than a few figures from the paper:
    I expect that DreamerV3 would be more sample efficient and so converge more quickly to getting the cheese every time during training
    I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
    I think that, given the same architecture (deep conv net) and hyperparameter settings, probably the cheese vector replicates.
    I think there’s also a good chance that the retargetability replicates (although using different channel numbers, of course, there was no grand reason why 55 was a cheese-tracking-channel).
    Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
    If you think we disagree enough to have a bet here, lmk and I’ll read the alg more next week to give you some odds.
    That’s the kind of thing I had in mind as a “stronger” model, not just more parameters.
    ...
    In general, I expect that DreamerV-N+1 and future SotA RL algorithms look even more like pure reward-function maximizers, given even less (and less general) training data.
    The most capable systems today (LLMs) don’t rely on super fancy RL algs—they often use PPO or some variant—and they get stronger in large part by getting more data and parameters.
    However, it’s still interesting to understand how diff training processes produce diff kinds of behavior. So I think the experiment you propose is interesting, but I think shard theory-for-LLMs ultimately doesn’t gain or lose a ton either way.
    - Max H 2 May 2023 14:23 UTC
      LW: 3 AF: 3
      0
      AF Parent
      Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
      The world model is a recurrent state space model, and the actor model takes the latent state of the world model as input. But there’s no tree search or other hand-coded exploration going on during inference.
      I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
      This is the question I am most interested in. I’d bet at even odds that Dreamer-XL or even Dreamer-medium would get the cheese >90% of the time, and maybe 1:4 (20% implied probability) on >99%.
      On other results:
      Not sure if the recurrence makes some of the methods and results in the original post inapplicable or incomparable. But I do expect you can find cheese vectors and do analogous things with retargetability, perhaps by modifying the latent state of the world model, or by applying the techniques in the original post to the actor model.
      I expect that many of the behavioral statistics detailed in this post mostly don’t replicate, primarily because p(cheese acquired) goes up dramatically. For episodes where the agent doesn’t get the cheese (if there are any), I’d be curious what they look like. I don’t have strong predictions here, but I wouldn’t be surprised if they look qualitatively different and are not well predicted by the three features here. I think some of the most interesting comparisons would be between mazes where both agents fail to get the cheese—do they end up in the same place, by the same path, for example?
      What links here?
      Reward is the optimization target (of capabilities researchers) by Max H (15 May 2023 3:22 UTC; 32 points)
      TurnTrout's comment on Reward is the optimization target (of capabilities researchers) by Max H (15 May 2023 17:11 UTC; 4 points)