Trying to respond in what I think the original intended frame was:
A chess AI’s training bounds what the chess AI can know and learn to value. Given the inputs and outputs it has, it isn’t clear there is an amount of optimization pressure accessible to SGD which can yield situational awareness and so forth; nothing about the trained mapping incentivizes that. This form of chess AI can be described in the behaviorist sense as “wanting” to win within the boundaries of the space that it operates.
In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist “values” the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.
(For background, I think it’s relatively easy to relocate where the optimization squeezing is happening to avoid this sort of worldeating outcome, but it remains true that optimization for targets with ill-defined bounds is spooky and to be avoided.)
In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist “values” the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.
“If we build AI in this particular way, it will be dangerous”
I think training such an AI to be really good at chess would be fine. Unless “Then apply extreme optimization pressure for never losing at chess.” means something like “deliberately train it to use a bunch of non-chess strategies to win more chess games, like threatening opponents, actively seeking out more chess games in real life, etc”, then it seems like you just get GPT-5 which is also really good at chess.
In retrospect, the example I used was poorly specified. It wouldn’t surprise me if the result of the literal interpretation was “the AI refuses to play chess” rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn’t significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn’t actually the most reliable accessible strategy to “never lose at chess” for that broader type of system and I’d expect superior strategies to be found in the limit of optimization.
Yes, that would be immediately reward-hacked. It’s extremely easy to never lose chess: you simply never play. After all, how do you force anyone to play chess...? “I’ll give you a billion dollars if you play chess.” “No, because I value not losing more than a billion dollars.” “I’m putting a gun to your head and will kill you if you don’t play!” “Oh, please do, thank you—after all, it’s impossible to lose a game of chess if I’m dead!” This is why RL agents have a nasty tendency to learn to ‘commit suicide’ if you reward-shape badly or the environment is too hard. (Tom7′s lexicographic agent famously learns to simply pause Tetris to avoid losing.)
Trying to respond in what I think the original intended frame was:
A chess AI’s training bounds what the chess AI can know and learn to value. Given the inputs and outputs it has, it isn’t clear there is an amount of optimization pressure accessible to SGD which can yield situational awareness and so forth; nothing about the trained mapping incentivizes that. This form of chess AI can be described in the behaviorist sense as “wanting” to win within the boundaries of the space that it operates.
In contrast, suppose you have a strong and knowledgeable multimodal predictor trained on all data humanity has available to it that can output arbitrary strings. Then apply extreme optimization pressure for never losing at chess. Now, the boundaries of the space in which the AI operates are much broader, and the kinds of behaviorist “values” the AI can have are far less constrained. It has the ability to route through the world, and with extreme optimization, it seems likely that it will.
(For background, I think it’s relatively easy to relocate where the optimization squeezing is happening to avoid this sort of worldeating outcome, but it remains true that optimization for targets with ill-defined bounds is spooky and to be avoided.)
“If we build AI in this particular way, it will be dangerous”
Okay, so maybe don’t do that then.
I think training such an AI to be really good at chess would be fine. Unless “Then apply extreme optimization pressure for never losing at chess.” means something like “deliberately train it to use a bunch of non-chess strategies to win more chess games, like threatening opponents, actively seeking out more chess games in real life, etc”, then it seems like you just get GPT-5 which is also really good at chess.
In retrospect, the example I used was poorly specified. It wouldn’t surprise me if the result of the literal interpretation was “the AI refuses to play chess” rather than any kind of worldeating. The intent was to pick a sparse/distant reward that doesn’t significantly constrain the kind of strategies that could develop, and then run an extreme optimization process on it. In other words, while intermediate optimization may result in improvements to chess playing, being better at chess isn’t actually the most reliable accessible strategy to “never lose at chess” for that broader type of system and I’d expect superior strategies to be found in the limit of optimization.
Yes, that would be immediately reward-hacked. It’s extremely easy to never lose chess: you simply never play. After all, how do you force anyone to play chess...? “I’ll give you a billion dollars if you play chess.” “No, because I value not losing more than a billion dollars.” “I’m putting a gun to your head and will kill you if you don’t play!” “Oh, please do, thank you—after all, it’s impossible to lose a game of chess if I’m dead!” This is why RL agents have a nasty tendency to learn to ‘commit suicide’ if you reward-shape badly or the environment is too hard. (Tom7′s lexicographic agent famously learns to simply pause Tetris to avoid losing.)