As a toy-model point of comparison, here’s one thing that could hypothetically happen during “self-modeling” of the activations of layer L: (1) the model always guesses that the activations of layer L are all 0; (2) gradient descent sculpts the model to have very small activations in layer L.
In this scenario, it’s not really “self-modeling” at all, but rather a roundabout way to implement “activation regularization” specifically targeted to layer L.
In “activation regularization”, the auxiliary loss term is just |a|2, whereas in your study it’s |a−^a|2 (where a is the layer L activation vector and ^a is the self-modeling guess vector). So activation regularization might be a better point of comparison than the weight regularization that you brought up in the appendix. E.g. activation regularization does have the property that it “adapts based on the structure and distribution of the input data”.
I’d be curious whether you get similar “network complexity” (SD & RLCT) results with plain old activation regularization. That might be helpful for disentangling the activation regularization from bona fide self-modeling.
(I haven’t really thought through the details. Is there batch norm? If so, how does that interact with what I wrote? Also, in my example at the top, I could have said “the model always guesses that the activations are some fixed vector V” instead of “…that the activations are all 0”. Does that make any difference? I dunno.)
Sorry if this is all stupid, or in the paper somewhere.
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss (^a−a)2 in terms of the self-modeling layer, we get ||(Wa+b−a)||2=||(W−I)a+b||2=||Ea+b||2.
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of E). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient ∂Lself/∂W=2EAA⊤, we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.
As a toy-model point of comparison, here’s one thing that could hypothetically happen during “self-modeling” of the activations of layer L: (1) the model always guesses that the activations of layer L are all 0; (2) gradient descent sculpts the model to have very small activations in layer L.
In this scenario, it’s not really “self-modeling” at all, but rather a roundabout way to implement “activation regularization” specifically targeted to layer L.
In “activation regularization”, the auxiliary loss term is just |a|2, whereas in your study it’s |a−^a|2 (where a is the layer L activation vector and ^a is the self-modeling guess vector). So activation regularization might be a better point of comparison than the weight regularization that you brought up in the appendix. E.g. activation regularization does have the property that it “adapts based on the structure and distribution of the input data”.
I’d be curious whether you get similar “network complexity” (SD & RLCT) results with plain old activation regularization. That might be helpful for disentangling the activation regularization from bona fide self-modeling.
(I haven’t really thought through the details. Is there batch norm? If so, how does that interact with what I wrote? Also, in my example at the top, I could have said “the model always guesses that the activations are some fixed vector V” instead of “…that the activations are all 0”. Does that make any difference? I dunno.)
Sorry if this is all stupid, or in the paper somewhere.
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss (^a−a)2 in terms of the self-modeling layer, we get ||(Wa+b−a)||2=||(W−I)a+b||2=||Ea+b||2.
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of E). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient ∂Lself/∂W=2EAA⊤, we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.