The comparison to activation regularization is quite interesting. When we write down the self-modeling loss (^a−a)2 in terms of the self-modeling layer, we get ||(Wa+b−a)||2=||(W−I)a+b||2=||Ea+b||2.
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of E). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient ∂Lself/∂W=2EAA⊤, we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss (^a−a)2 in terms of the self-modeling layer, we get ||(Wa+b−a)||2=||(W−I)a+b||2=||Ea+b||2.
This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of E). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient ∂Lself/∂W=2EAA⊤, we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.