Max H comments on Behavioural statistics for a maze-solving agent

Max H 2 May 2023 14:23 UTC
LW: 3 AF: 3
0
AF
Again, IDK this particular alg; I’m just imagining good model-based RL + exploration → a policy net. (If DreamerV3 does planning using its WM at inference time, that seems like a more substantially different story.)
The world model is a recurrent state space model, and the actor model takes the latent state of the world model as input. But there’s no tree search or other hand-coded exploration going on during inference.
I expect that DreamerV3 would still train policy nets which don’t get the cheese all the time in test, although I wouldn’t be more surprised if it got it more than 69% of the time. I’d be somewhat surprised by >90%, and very surprised by >99%.
This is the question I am most interested in. I’d bet at even odds that Dreamer-XL or even Dreamer-medium would get the cheese >90% of the time, and maybe 1:4 (20% implied probability) on >99%.
On other results:
- Not sure if the recurrence makes some of the methods and results in the original post inapplicable or incomparable. But I do expect you can find cheese vectors and do analogous things with retargetability, perhaps by modifying the latent state of the world model, or by applying the techniques in the original post to the actor model.
- I expect that many of the behavioral statistics detailed in this post mostly don’t replicate, primarily because p(cheese acquired) goes up dramatically. For episodes where the agent doesn’t get the cheese (if there are any), I’d be curious what they look like. I don’t have strong predictions here, but I wouldn’t be surprised if they look qualitatively different and are not well predicted by the three features here. I think some of the most interesting comparisons would be between mazes where both agents fail to get the cheese—do they end up in the same place, by the same path, for example?
What links here?
- Reward is the optimization target (of capabilities researchers) by Max H (15 May 2023 3:22 UTC; 32 points)
- TurnTrout's comment on Reward is the optimization target (of capabilities researchers) by Max H (15 May 2023 17:11 UTC; 4 points)