jacob_cannell comments on Are minimal circuits deceptive?

jacob_cannell 16 Dec 2021 2:28 UTC
2 points

This is perhaps not directly related to your argument here, but how is inner alignment failure distinct from generalization failure?

Yes, inner alignment is a subset of robustness. See the discussion here and my taxonomy here.

Ahh thanks I found that post/discussion more clear than the original paper.

This reflects a misunderstanding of what a mesa-optimizer is—as we say in Risks from Learned Optimization:

So in my example, the whole network N is just a mesa-optimizer according to your definition. That doesn’t really change anything, but your earlier links already answered my question.

There are models with perfect performance on any training dataset that you can generate that nevertheless have catastrophic behavior off-distribution.

I should have clarified I meant grokking only shows prefect generalization on-distribution. Yes, off-distribution failure is always possible (and practically unavoidable in general considering adversarial distributions) - that deceptive RSA example is interesting.

I don’t think that’s a good characterization of lottery tickets. Lottery tickets just says that, for any randomly initialized large neural network, there usually exists a pruning of that network with very good performance on any problem (potentially after some small amount of training). That doesn’t imply that all those possible prunings are in some sense “active” at initialization, any more than all possible subgraphs are active in a complete graph. It just says that pruning is a really powerful training method and that the space of possible prunings is very large due to combinatorial explosion.

I mentioned lottery tickets as examples of minimal effective sub-models embedded in larger trained over-complete models. I’m not sure what you mean by “active” at initialization, as they are created through training.

I’m gesturing at the larger almost self-evident generalization hypothesis—that overcomplete ANNs, with proper regularization and other features—can efficiently explore an exponential/combinatoric space of possible solutions (sub-models) in parallel, where each sub-model corresponds roughly to a lottery ticket. Weight or other regularization is essential and helps ensure the training process comes to approximate bayesian inference over the space of sub-models.