If Shard theory is false, I would expect it to be false in the sense that as models get smarter, they stop pursuing proxies learned in early training as terminal goals and aim for different things instead. That not-smart models follow rough proxy heuristics for what to do seems like the normal ml expectation to me, rather than a new prediction of Shard Theory.
Are the models you use to play Minecraft or CoinRun smart enough to probe that difference? Are you sure that they are like mesa-optimisers that really want to get diamonds or make diamond pickaxes or grab coins, rather than like collections of “if a, do b” heuristics with relatively little planning capacity that will keep following their script even as situations change? Because in the later case, I don’t think you’d be learning much about Shard theory by observing them.
If Shard theory is false, I would expect it to be false in the sense that as models get smarter, they stop pursuing proxies learned in early training as terminal goals and aim for different things instead. That not-smart models follow rough proxy heuristics for what to do seems like the normal ml expectation to me, rather than a new prediction of Shard Theory.
Are the models you use to play Minecraft or CoinRun smart enough to probe that difference? Are you sure that they are like mesa-optimisers that really want to get diamonds or make diamond pickaxes or grab coins, rather than like collections of “if a, do b” heuristics with relatively little planning capacity that will keep following their script even as situations change? Because in the later case, I don’t think you’d be learning much about Shard theory by observing them.