Jeremy Gillen comments on Inner Alignment via Superpowers

Jeremy Gillen Sep 5, 2022, 6:21 PM
9 points
1
I agree in the case of a model-free agent (although we think it should scale up better to be having the agent find its own adversarial examples).

In the model based agent, I think the case is better. Because you can implement the superpowers on its own world model (I.e. mu-zero that has the additional action of overwriting some or all of its world model latent state during rollouts), then the distribution shift that happens when capabilities get higher is much smaller, and depends mainly on how much the world model has changed its representation of the world state. This is a strictly smaller distribution shift to what you would have otherwise, because it has ~eliminated the shift that comes from not being able to access most states during the lower capabilities regime.
- Rohin Shah Sep 6, 2022, 12:23 PM
  3 points
  0
  Parent
  But in most model-based agents, the world model is integral to action selection? I don’t really understand how you give an agent like MuZero the ability to overwrite its world model (how do you train it? Heck, how do you even identify which part of the world model corresponds to “move the coin”?)
  Also, I forgot to mention, but you need to make your superpowers less super. If you literally include things like “move the coin”and “teleport anywhere in the grid”, then your agent will learn the policy “take the superpower-action to get to the coin, end episode”, and will never learn any capabilities and will fail to do anything once you remove the superpower.
  - Jeremy Gillen Sep 6, 2022, 7:57 PM
    1 point
    0
    Parent
    The way I imagine it, at random times throughout training (maybe halfway through a game), the agent would go into “imagination mode”, where it is allowed to use $k$ extra continuous scalar actions for bootstrapping rollouts (not interacting with the real environment). Each extra action pushes the world state along a random vector (constant during each time it enters this mode).
    During “imagination mode”, the agent chooses an action according to its policy function, and the world model + hard-coded superpower perturbation shows the consequences of the action in the WM latent state. We use this to do a bunch of $n$ step rollouts and use them for bootstrapping: feed each rollout into the (aligned)^[1] utility function, and use the resulting improved policy estimate to update the policy function.
    Because the action space is changing and randomly limited, the policy function will learn to test out and choose superpowered actions based on their consequences, which will force it to learn an approximation of the value of the consequences of its actions. And because the superpowered actions aren’t always available, it will also have to learn normal capabilities simultaneously.
    ^
    Applying this method to model based RL requires that we have an aligned utility function on the world model latent state: WM-state/sequence → R. We came up with this method when thinking about how to address inner misalignment 1 in Finding Goals in the World Model (misalignment between the policy function and aligned utility function).