Changing board-game rules as a test environment for unscoped consequentialism.
The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
The moment at which I become aware of these other options seems critical.
It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
I think this provides a pretty great analogy to expanding the scope of your action set.
There’s also some relevance to training/deployment mismatches.
If you’re teaching a language model to play the game, the specific ‘changing the rules’ actions could be included in the ‘instruction set’ for the game.
This might provide insight/the opportunity to experiment on (to flesh out in depth):
Myopia
Deception (if we select away from agents who make these illegal moves)
useful bounds on consequentialism
More specific things like, in the language models example above, whether saying ‘don’t do these things, they’re not allowed’, works better or worse than not mentioning them at all.
Idea I want to flesh out into a full post:
Changing board-game rules as a test environment for unscoped consequentialism.
The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
The moment at which I become aware of these other options seems critical.
It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
I think this provides a pretty great analogy to expanding the scope of your action set.
There’s also some relevance to training/deployment mismatches.
If you’re teaching a language model to play the game, the specific ‘changing the rules’ actions could be included in the ‘instruction set’ for the game.
This might provide insight/the opportunity to experiment on (to flesh out in depth):
Myopia
Deception (if we select away from agents who make these illegal moves)
useful bounds on consequentialism
More specific things like, in the language models example above, whether saying ‘don’t do these things, they’re not allowed’, works better or worse than not mentioning them at all.