This seems promising to me and information gathering + steps towards an alignment plan. As an alignment plan, it’s not clear that behaving well at capability level 1, plus behaving well on a small training set at capability level 2, generalizes correctly to good behavior at capability level 1 million.
The experiment limits you to superpowers you can easily assign, not human-level capability, but it seems like the space of this is still pretty small compared to the space of actions a really powerful agent might have in the real world.
In the model based RL set up, we are planning to give it actions that can directly modify the game state in any way it likes. This is sort of like an arbitrarily-powerful superpower, because it can change anything it wants about the world, except of course that this is a cartesian environment and so it can’t, e.g., recursively self improve.
With model free RL, this strategy doesn’t obviously carry over so I agree that we are limited to easily codeable superpowers. .
This seems promising to me and information gathering + steps towards an alignment plan. As an alignment plan, it’s not clear that behaving well at capability level 1, plus behaving well on a small training set at capability level 2, generalizes correctly to good behavior at capability level 1 million.
The experiment limits you to superpowers you can easily assign, not human-level capability, but it seems like the space of this is still pretty small compared to the space of actions a really powerful agent might have in the real world.
In the model based RL set up, we are planning to give it actions that can directly modify the game state in any way it likes. This is sort of like an arbitrarily-powerful superpower, because it can change anything it wants about the world, except of course that this is a cartesian environment and so it can’t, e.g., recursively self improve.
With model free RL, this strategy doesn’t obviously carry over so I agree that we are limited to easily codeable superpowers. .