The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find a x-bit optimizer should be independent of the environment.
This sounds blatantly false to me unless the only things you count as optimizers are things like AIXI. (In particular, humans could not count as optimizers.) Especially if you’ve already put in some base-level optimization to narrow down the search space. You then need to “encode” the knowledge you already have into the optimizer, so that it doesn’t redo the work already done by the base-level optimization. In practice this looks like inductive biases and embedding of domain knowledge.
As a particular example, many NP-hard problems become easy if they have particular structure. (Horn-SAT and 2-SAT are easy while SAT is hard; longest path is easy in DAGs but hard with general graphs.)
I get that this is a footnote and that it’s a toy model that doesn’t claim to mimic reality, but it seems like a very false statement to me.
x=argmaxxP−f(x)N+x
Can you rename the LHS variable x to something else, like x∗, to avoid confusion with the x on the RHS?
Algorithmic range.
Is this different from expressivity, or the size of the model’s hypothesis class? I can’t tell how this is different from the inductive biases section, especially its third point.
For example, architectures that explicitly give the algorithm access to a wide range of possible computations, such as recurrent neural networks or neural Turing machines,(14) seem more likely to produce mesa-optimizers.
I (and I suspect other ML researchers) would call these inductive biases, not algorithmic range / model expressivity.
AFAICT, algorithmic range isn’t the same thing as model capacity: I think that tabular learners have low algorithmic range, as the terms are used in this post, but high model capacity.
It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn’t, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.
Renamed x to x∗—good catch (though editing doesn’t appear to be working for me right now—it should show up in a bit)!
Algorithmic range is very similar to model capacity, except that we’re thinking slightly more broadly as we’re more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they’re basically the same thing.
I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I’m keeping algorithmic range here.
I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both.
Fwiw, I have the opposite intuition quite strongly, but not sure it’s worth debating that here.
Minor nitpicks:
This sounds blatantly false to me unless the only things you count as optimizers are things like AIXI. (In particular, humans could not count as optimizers.) Especially if you’ve already put in some base-level optimization to narrow down the search space. You then need to “encode” the knowledge you already have into the optimizer, so that it doesn’t redo the work already done by the base-level optimization. In practice this looks like inductive biases and embedding of domain knowledge.
As a particular example, many NP-hard problems become easy if they have particular structure. (Horn-SAT and 2-SAT are easy while SAT is hard; longest path is easy in DAGs but hard with general graphs.)
I get that this is a footnote and that it’s a toy model that doesn’t claim to mimic reality, but it seems like a very false statement to me.
Can you rename the LHS variable x to something else, like x∗, to avoid confusion with the x on the RHS?
Is this different from expressivity, or the size of the model’s hypothesis class? I can’t tell how this is different from the inductive biases section, especially its third point.
I (and I suspect other ML researchers) would call these inductive biases, not algorithmic range / model expressivity.
AFAICT, algorithmic range isn’t the same thing as model capacity: I think that tabular learners have low algorithmic range, as the terms are used in this post, but high model capacity.
It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn’t, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.
Renamed x to x∗—good catch (though editing doesn’t appear to be working for me right now—it should show up in a bit)!
Algorithmic range is very similar to model capacity, except that we’re thinking slightly more broadly as we’re more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they’re basically the same thing.
I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I’m keeping algorithmic range here.
Fwiw, I have the opposite intuition quite strongly, but not sure it’s worth debating that here.