I mean something like getting stuck in local optima on a hard problem. An extreme example would be if I try to teach you to play chess by having you play against Stockfish over and over, and give you a reward for each piece you capture—you’re going to learn to play chess in a way that trades pieces short-term but doesn’t win the game.
Or, like, if you think of shard formation as inner alignment failure that works on the training distribution, the environment being too hard to navigate shrinks the “effective” training distribution that inner alignment failures generalize over.
I mean something like getting stuck in local optima on a hard problem. An extreme example would be if I try to teach you to play chess by having you play against Stockfish over and over, and give you a reward for each piece you capture—you’re going to learn to play chess in a way that trades pieces short-term but doesn’t win the game.
Or, like, if you think of shard formation as inner alignment failure that works on the training distribution, the environment being too hard to navigate shrinks the “effective” training distribution that inner alignment failures generalize over.