I think we would benefit from tabooing the word “simple”. It seems to me that when people use the word “simple” in the context of ML, they are usually referring to either smoothness/Lipschitzness or minimum description length. But it’s easy to see that these metrics don’t always coincide. A random walk is smooth, but its minimum description length is long. A tall square wave is not smooth, but its description length is short. L2 regularization makes a model smoother without reducing its description length. Quantization reduces a model’s description length without making it smoother. I’m actually not aware of any argument that smoothness and description length are or should be related—it seems like this might be an unexamined premise.
Based on your paper, the argument for mesa-optimizers seems to be about description length. But if SGD’s inductive biases target smoothness, it’s not clear why we should expect SGD to discover mesa-optimizers. Perhaps you think smooth functions tend to be more compressible than functions which aren’t smooth. I don’t think that’s enough. Imagine a Venn diagram where compressible functions are a big circle. Mesa-optimizers are a subset, and the compressible functions discovered by SGD are another subset. The question is whether these two subsets are overlapping. Pointing out that they’re both compressible is not a strong argument for overlap: “all cats are mammals, and all dogs are mammals, so therefore if you see a cat, it’s also likely to be a dog”.
When I read your paper, I get a sense that an optimizers outperform by allowing one to collapse a lot of redundant functionality into a single general method. It seems like maybe it’s the act of compression that gets you an agent, not the property of being compressible. If our model is a smooth function which could in principle be compressed using a single general method, I’m not seeing why the reapplication of that general method in a very novel context is something we should expect to happen.
BTW I actually do think minimum description length is something we’ll have to contend with long term. It’s just too useful as an inductive bias. (Eliminating redundancies in your cognition seems like a basic thing an AGI will need to do to stay competitive.) But I’m unconvinced SGD possesses the minimum description length inductive bias. Especially if e.g. the flat minima story is the one that’s true (as opposed to e.g. the lottery ticket story).
Also, I’m less confident that what I wrote above applies to RNNs.
I think we would benefit from tabooing the word “simple”. It seems to me that when people use the word “simple” in the context of ML, they are usually referring to either smoothness/Lipschitzness or minimum description length. But it’s easy to see that these metrics don’t always coincide. A random walk is smooth, but its minimum description length is long. A tall square wave is not smooth, but its description length is short. L2 regularization makes a model smoother without reducing its description length. Quantization reduces a model’s description length without making it smoother. I’m actually not aware of any argument that smoothness and description length are or should be related—it seems like this might be an unexamined premise.
Based on your paper, the argument for mesa-optimizers seems to be about description length. But if SGD’s inductive biases target smoothness, it’s not clear why we should expect SGD to discover mesa-optimizers. Perhaps you think smooth functions tend to be more compressible than functions which aren’t smooth. I don’t think that’s enough. Imagine a Venn diagram where compressible functions are a big circle. Mesa-optimizers are a subset, and the compressible functions discovered by SGD are another subset. The question is whether these two subsets are overlapping. Pointing out that they’re both compressible is not a strong argument for overlap: “all cats are mammals, and all dogs are mammals, so therefore if you see a cat, it’s also likely to be a dog”.
When I read your paper, I get a sense that an optimizers outperform by allowing one to collapse a lot of redundant functionality into a single general method. It seems like maybe it’s the act of compression that gets you an agent, not the property of being compressible. If our model is a smooth function which could in principle be compressed using a single general method, I’m not seeing why the reapplication of that general method in a very novel context is something we should expect to happen.
BTW I actually do think minimum description length is something we’ll have to contend with long term. It’s just too useful as an inductive bias. (Eliminating redundancies in your cognition seems like a basic thing an AGI will need to do to stay competitive.) But I’m unconvinced SGD possesses the minimum description length inductive bias. Especially if e.g. the flat minima story is the one that’s true (as opposed to e.g. the lottery ticket story).
Also, I’m less confident that what I wrote above applies to RNNs.