Cameron Berg comments on Self-prediction acts as an emergent regularizer

Cameron Berg 24 Oct 2024 17:47 UTC
5 points
0
The comparison to activation regularization is quite interesting. When we write down the self-modeling loss $(^a - a)^{2}$ in terms of the self-modeling layer, we get $| | (W a + b - a) | |^{2} = | | (W - I) a + b | |^{2} = | | E a + b | |^{2}$ .

This does resemble activation regularization, with the strength of regularization attenuated by how far the weight matrix is from identity (the magnitude of $E$ ). However, due to the recurrent nature of this loss—where updates to the weight matrix depend on activations that are themselves being updated by the loss—the resulting dynamics are more complex in practice. Looking at the gradient $\partial L_{s e l f} / \partial W = 2 E A A^{⊤}$ , we see that self-modeling depends on the full covariance structure of activations, not just pushing them toward zero or any fixed vector. The network must learn to actively predict its own evolving activation patterns rather than simply constraining their magnitude.
Comparing the complexity measures (SD & RLCT) between self-modeling and activation regularization is a great idea and we will definitely add this to the roadmap and report back. And batch norm/other forms of regularization were not added.