As a toy-model point of comparison, here’s one thing that could hypothetically happen during “self-modeling” of the activations of layer L: (1) the model always guesses that the activations of layer L are all 0; (2) gradient descent sculpts the model to have very small activations in layer L.
In this scenario, it’s not really “self-modeling” at all, but rather a roundabout way to implement “activation regularization” specifically targeted to layer L.
In “activation regularization”, the auxiliary loss term is just |a|2, whereas in your study it’s |a−^a|2 (where a is the layer L activation vector and ^a is the self-modeling guess vector). So activation regularization might be a better point of comparison than the weight regularization that you brought up in the appendix. E.g. activation regularization does have the property that it “adapts based on the structure and distribution of the input data”.
I’d be curious whether you get similar “network complexity” (SD & RLCT) results with plain old activation regularization. That might be helpful for disentangling the activation regularization from bona fide self-modeling.
(I haven’t really thought through the details. Is there batch norm? If so, how does that interact with what I wrote? Also, in my example at the top, I could have said “the model always guesses that the activations are some fixed vector V” instead of “…that the activations are all 0”. Does that make any difference? I dunno.)
Sorry if this is all stupid, or in the paper somewhere.
As a toy-model point of comparison, here’s one thing that could hypothetically happen during “self-modeling” of the activations of layer L: (1) the model always guesses that the activations of layer L are all 0; (2) gradient descent sculpts the model to have very small activations in layer L.
In this scenario, it’s not really “self-modeling” at all, but rather a roundabout way to implement “activation regularization” specifically targeted to layer L.
In “activation regularization”, the auxiliary loss term is just |a|2, whereas in your study it’s |a−^a|2 (where a is the layer L activation vector and ^a is the self-modeling guess vector). So activation regularization might be a better point of comparison than the weight regularization that you brought up in the appendix. E.g. activation regularization does have the property that it “adapts based on the structure and distribution of the input data”.
I’d be curious whether you get similar “network complexity” (SD & RLCT) results with plain old activation regularization. That might be helpful for disentangling the activation regularization from bona fide self-modeling.
(I haven’t really thought through the details. Is there batch norm? If so, how does that interact with what I wrote? Also, in my example at the top, I could have said “the model always guesses that the activations are some fixed vector V” instead of “…that the activations are all 0”. Does that make any difference? I dunno.)
Sorry if this is all stupid, or in the paper somewhere.