This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure? If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren’t all questions of whether M is aligned with N’s optimization objective just generalization questions?
More specifically if N is sufficiently overcomplete and well regularized, and D is large enough, then N can fully grok the dataset D, resulting in perfect generalization. It’s also straightforward as to why this can happen—when N is large enough to contain enough individually regularized sub-model solutions (lottery tickets) that it is approximating a solomonoff style ensemble.
Anyway if N has a measurably low generalization gap on D, then it doesn’t seem to matter whether M exists or what it’s doing with regard to generalization on D. So is the risk of ‘inner alignment failure’ involve out of distribution generalization?
This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure?
Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.
If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren’t all questions of whether M is aligned with N’s optimization objective just generalization questions?
Possible misunderstanding: “mesa-optimizer” does not mean “subsystem” or “subagent.” In the context of deep learning, a mesa-optimizer is simply a neural network that is implementing some optimization process and not some emergent subagent inside that neural network. Mesa-optimizers are simply a particular type of algorithm that the base optimizer might find to solve its task. Furthermore, we will generally be thinking of the base optimizer as a straightforward optimization algorithm, and not as an intelligent agent choosing to create a subagent.
More specifically if N is sufficiently overcomplete and well regularized, and D is large enough, then N can fully grok the dataset D, resulting in perfect generalization. It’s also straightforward as to why this can happen—when N is large enough to contain enough individually regularized sub-model solutions (lottery tickets) that it is approximating a solomonoff style ensemble.
I don’t think this characterization is correct. A couple of points:
There are models with perfect performance on any training dataset that you can generate that nevertheless have catastrophic behavior off-distribution. For example: a deceptive model that purposefully always takes minimal-loss actions to prevent the training process from modifying it but starts acting catastrophically when it sees a factorization of RSA-2048.
I don’t think that’s a good characterization of lottery tickets. Lottery tickets just says that, for any randomly initialized large neural network, there usually exists a pruning of that network with very good performance on any problem (potentially after some small amount of training). That doesn’t imply that all those possible prunings are in some sense “active” at initialization, any more than all possible subgraphs are active in a complete graph. It just says that pruning is a really powerful training method and that the space of possible prunings is very large due to combinatorial explosion.
This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure?
Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.
Ahh thanks I found that post/discussion more clear than the original paper.
This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learned Optimization:
So in my example, the whole network N is just a mesa-optimizer according to your definition. That doesn’t really change anything, but your earlier links already answered my question.
There are models with perfect performance on any training dataset that you can generate that nevertheless have catastrophic behavior off-distribution.
I should have clarified I meant grokking only shows prefect generalization on-distribution. Yes, off-distribution failure is always possible (and practically unavoidable in general considering adversarial distributions) - that deceptive RSA example is interesting.
I don’t think that’s a good characterization of lottery tickets. Lottery tickets just says that, for any randomly initialized large neural network, there usually exists a pruning of that network with very good performance on any problem (potentially after some small amount of training). That doesn’t imply that all those possible prunings are in some sense “active” at initialization, any more than all possible subgraphs are active in a complete graph. It just says that pruning is a really powerful training method and that the space of possible prunings is very large due to combinatorial explosion.
I mentioned lottery tickets as examples of minimal effective sub-models embedded in larger trained over-complete models. I’m not sure what you mean by “active” at initialization, as they are created through training.
I’m gesturing at the larger almost self-evident generalization hypothesis—that overcomplete ANNs, with proper regularization and other features—can efficiently explore an exponential/combinatoric space of possible solutions (sub-models) in parallel, where each sub-model corresponds roughly to a lottery ticket. Weight or other regularization is essential and helps ensure the training process comes to approximate bayesian inference over the space of sub-models.
This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure? If you train network N on dataset D and optimization pressure causes N to internally develop a planning system (mesa-optimizer) M, aren’t all questions of whether M is aligned with N’s optimization objective just generalization questions?
More specifically if N is sufficiently overcomplete and well regularized, and D is large enough, then N can fully grok the dataset D, resulting in perfect generalization. It’s also straightforward as to why this can happen—when N is large enough to contain enough individually regularized sub-model solutions (lottery tickets) that it is approximating a solomonoff style ensemble.
Anyway if N has a measurably low generalization gap on D, then it doesn’t seem to matter whether M exists or what it’s doing with regard to generalization on D. So is the risk of ‘inner alignment failure’ involve out of distribution generalization?
Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.
This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learned Optimization:
I don’t think this characterization is correct. A couple of points:
There are models with perfect performance on any training dataset that you can generate that nevertheless have catastrophic behavior off-distribution. For example: a deceptive model that purposefully always takes minimal-loss actions to prevent the training process from modifying it but starts acting catastrophically when it sees a factorization of RSA-2048.
I don’t think that’s a good characterization of lottery tickets. Lottery tickets just says that, for any randomly initialized large neural network, there usually exists a pruning of that network with very good performance on any problem (potentially after some small amount of training). That doesn’t imply that all those possible prunings are in some sense “active” at initialization, any more than all possible subgraphs are active in a complete graph. It just says that pruning is a really powerful training method and that the space of possible prunings is very large due to combinatorial explosion.
Ahh thanks I found that post/discussion more clear than the original paper.
So in my example, the whole network N is just a mesa-optimizer according to your definition. That doesn’t really change anything, but your earlier links already answered my question.
I should have clarified I meant grokking only shows prefect generalization on-distribution. Yes, off-distribution failure is always possible (and practically unavoidable in general considering adversarial distributions) - that deceptive RSA example is interesting.
I mentioned lottery tickets as examples of minimal effective sub-models embedded in larger trained over-complete models. I’m not sure what you mean by “active” at initialization, as they are created through training.
I’m gesturing at the larger almost self-evident generalization hypothesis—that overcomplete ANNs, with proper regularization and other features—can efficiently explore an exponential/combinatoric space of possible solutions (sub-models) in parallel, where each sub-model corresponds roughly to a lottery ticket. Weight or other regularization is essential and helps ensure the training process comes to approximate bayesian inference over the space of sub-models.