Humans can be stumped, but we’re fairly good at dynamic strategy selection, which tends to protect us from being reliably exploited.
Have you ever played Far Cry 4? At the beginning of that game, there is a scene where you’re being told by the main villain of the storyline to sit still while he goes downstairs to deal with some rebels. A normal human player would do the expected thing, which is to curiously explore what’s going on downstairs, which then leads to the unfolding of the main story and thus actual gameplay. But if you actually stick to the villain’s instruction and sit still for 12 minutes, it leads straight to the ending of the game.
This is an analogous situation to your scenario, except it’s one where humans reliably fail. Now you could argue that a human player’s goal is to actually play and enjoy the game, therefore it’s perfectly reasonable to explore and forego a quick ending. But I bet even if you incentivized a novice player to finish the game in under 2 hours with a million dollars, he would not think of exploiting this Easter egg.
More importantly, he would have learned absolutely nothing from this experience about how to act rationally (except for maybe stop believing that anyone would genuinely offer a million dollars out of the blue). The point is, it’s not just possible to rig the game against an agent for it to fail, it’s trivially easy when you have complete control of the environment. But it’s also irrelevant, because that’s not how reality works in general. And I do mean reality, not some fictional story or adversarial setup where things happen because the author says they happen.
Have you ever played Far Cry 4? At the beginning of that game, there is a scene where you’re being told by the main villain of the storyline to sit still while he goes downstairs to deal with some rebels. A normal human player would do the expected thing, which is to curiously explore what’s going on downstairs, which then leads to the unfolding of the main story and thus actual gameplay. But if you actually stick to the villain’s instruction and sit still for 12 minutes, it leads straight to the ending of the game.
This is an analogous situation to your scenario, except it’s one where humans reliably fail. Now you could argue that a human player’s goal is to actually play and enjoy the game, therefore it’s perfectly reasonable to explore and forego a quick ending. But I bet even if you incentivized a novice player to finish the game in under 2 hours with a million dollars, he would not think of exploiting this Easter egg.
More importantly, he would have learned absolutely nothing from this experience about how to act rationally (except for maybe stop believing that anyone would genuinely offer a million dollars out of the blue). The point is, it’s not just possible to rig the game against an agent for it to fail, it’s trivially easy when you have complete control of the environment. But it’s also irrelevant, because that’s not how reality works in general. And I do mean reality, not some fictional story or adversarial setup where things happen because the author says they happen.