Why do you believe that “But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren’t what we intended”?
I understand why this could cause the AI to fail, but why might it learn incorrect heuristics?
I mean something like getting stuck in local optima on a hard problem. An extreme example would be if I try to teach you to play chess by having you play against Stockfish over and over, and give you a reward for each piece you capture—you’re going to learn to play chess in a way that trades pieces short-term but doesn’t win the game.
Or, like, if you think of shard formation as inner alignment failure that works on the training distribution, the environment being too hard to navigate shrinks the “effective” training distribution that inner alignment failures generalize over.
Why do you believe that “But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren’t what we intended”?
I understand why this could cause the AI to fail, but why might it learn incorrect heuristics?
I mean something like getting stuck in local optima on a hard problem. An extreme example would be if I try to teach you to play chess by having you play against Stockfish over and over, and give you a reward for each piece you capture—you’re going to learn to play chess in a way that trades pieces short-term but doesn’t win the game.
Or, like, if you think of shard formation as inner alignment failure that works on the training distribution, the environment being too hard to navigate shrinks the “effective” training distribution that inner alignment failures generalize over.