Charlie Steiner comments on Shard theory alignment has important, often-overlooked free parameters.

Charlie Steiner 21 Jan 2023 1:53 UTC
3 points
1
I mean something like getting stuck in local optima on a hard problem. An extreme example would be if I try to teach you to play chess by having you play against Stockfish over and over, and give you a reward for each piece you capture—you’re going to learn to play chess in a way that trades pieces short-term but doesn’t win the game.
Or, like, if you think of shard formation as inner alignment failure that works on the training distribution, the environment being too hard to navigate shrinks the “effective” training distribution that inner alignment failures generalize over.