Latecomer, but as this relates to some of my prior work on self- and other-modeling, I thought I’d comment… The consistently high task accuracy displayed on Figure D suggests that even your smallest neural network is significantly over-capacity/over-parameterized for the test dataset. Excess capacity seems to be the only way the model can take on the expensive self-modeling task (*) without losing accuracy on the main task. Indeed, this would suggest that the explanation for the regularization benefit of self-modeling here is precisely that it soaks up the excess capacity, avoiding overfitting. But obviously, you can have too much of a good thing—as the experiments with fewer hidden layers show, the attention weight can take over the model’s focus and destroy accuracy. So it seems that, if you up the problem complexity/network size knob, the “maximum allowable attention weight” that doesn’t compromise accuracy will tend to zero. On the other hand, one can think of simpler tasks than fully predicting all of a layer’s activations—for example, predicting the activation signs, the maximum-minimum range, the mean activation, etc. I want to say these seem more meaningful anyway, and a way to avoid Borges’s “Map of the Empire whose size was that of the Empire”, no?
* BTW: Unless I missed it, the paper did not report the accuracy of the self-modeling task, only of the primary task, right? I must imagine it was far from perfect, as perfect self-modeling is only possible in trivial edge cases, right?
Way late to the game, but just arrived here through Richard Ngo’s recent Substack post and figured I might just mention that this is identical to the Ergodicity Economics program being developed by Ole Peters and Alex Adamou since about 2011. Probably worth checking out for some cross-pollination!