Rohin Shah comments on Inner Alignment via Superpowers

Rohin Shah 3 Sep 2022 15:00 UTC
8 points
0
“There’s an easy solution to this,” you might say. “Just present a whole boatload of environments where the goals vary along every axis, then they have to learn the right goal!”
“Our sweet summer child,” we respond, “if only it were so simple.”
[...]
Our solution is ‘giving the AI superpowers.’
[...]
Some current possible candidates for ‘superpowers’ in a gridworld environment, where the agent’s goal is to collect a coin, are:
- Move the coin
- Teleport anywhere in the grid
- Rewrite any cell in the grid
- Move through walls
The ‘superpower’ that we ultimately want to give the policy selector, in the model-based RL case, is the ability to ‘make all its dreams come true.’
Note that since you have to implement the superpowers via simulation, you could just present a whole boatload of environments where you randomly apply every superpower, instead of giving the superpower to the agent. Giving the superpower to the agent might be more efficient (depends on the setting) but it doesn’t seem qualitatively different.
- Jeremy Gillen 5 Sep 2022 18:21 UTC
  9 points
  1
  Parent
  I agree in the case of a model-free agent (although we think it should scale up better to be having the agent find its own adversarial examples).
  
  In the model based agent, I think the case is better. Because you can implement the superpowers on its own world model (I.e. mu-zero that has the additional action of overwriting some or all of its world model latent state during rollouts), then the distribution shift that happens when capabilities get higher is much smaller, and depends mainly on how much the world model has changed its representation of the world state. This is a strictly smaller distribution shift to what you would have otherwise, because it has ~eliminated the shift that comes from not being able to access most states during the lower capabilities regime.
  - Rohin Shah 6 Sep 2022 12:23 UTC
    3 points
    0
    Parent
    But in most model-based agents, the world model is integral to action selection? I don’t really understand how you give an agent like MuZero the ability to overwrite its world model (how do you train it? Heck, how do you even identify which part of the world model corresponds to “move the coin”?)
    Also, I forgot to mention, but you need to make your superpowers less super. If you literally include things like “move the coin”and “teleport anywhere in the grid”, then your agent will learn the policy “take the superpower-action to get to the coin, end episode”, and will never learn any capabilities and will fail to do anything once you remove the superpower.
    - Jeremy Gillen 6 Sep 2022 19:57 UTC
      1 point
      0
      Parent
      The way I imagine it, at random times throughout training (maybe halfway through a game), the agent would go into “imagination mode”, where it is allowed to use $k$ extra continuous scalar actions for bootstrapping rollouts (not interacting with the real environment). Each extra action pushes the world state along a random vector (constant during each time it enters this mode).
      During “imagination mode”, the agent chooses an action according to its policy function, and the world model + hard-coded superpower perturbation shows the consequences of the action in the WM latent state. We use this to do a bunch of $n$ step rollouts and use them for bootstrapping: feed each rollout into the (aligned)^[1] utility function, and use the resulting improved policy estimate to update the policy function.
      Because the action space is changing and randomly limited, the policy function will learn to test out and choose superpowered actions based on their consequences, which will force it to learn an approximation of the value of the consequences of its actions. And because the superpowered actions aren’t always available, it will also have to learn normal capabilities simultaneously.
      ^
      Applying this method to model based RL requires that we have an aligned utility function on the world model latent state: WM-state/sequence → R. We came up with this method when thinking about how to address inner misalignment 1 in Finding Goals in the World Model (misalignment between the policy function and aligned utility function).