The only difference is that we’re betting there’ll be a lot of interesting, foreseeable structure in which mesa-optimizers are learned, conditional on a choice of key training parameters. We have some early conjectures as to what that structure will be, and the project is to build up an understanding and impressively win a bunch of Bayes points with it. Most alignment people don’t especially think this is an important question to ask, because they don’t think there’ll end up being a lot of predictable structure in which proxies mesa-optimizers latch on to.
Shard theory is also a bet that proto-mesa-optimizers are also the mechanistic explanation of how current deep RL (and other deep ML settings, to a lesser extent) works.
The only difference is that we’re betting there’ll be a lot of interesting, foreseeable structure in which mesa-optimizers are learned, conditional on a choice of key training parameters. We have some early conjectures as to what that structure will be, and the project is to build up an understanding and impressively win a bunch of Bayes points with it. Most alignment people don’t especially think this is an important question to ask, because they don’t think there’ll end up being a lot of predictable structure in which proxies mesa-optimizers latch on to.
Shard theory is also a bet that proto-mesa-optimizers are also the mechanistic explanation of how current deep RL (and other deep ML settings, to a lesser extent) works.