Thanks for this! Consider the self-modeling loss gradient: ∂Lself/∂W=2(WA−A)A⊤=2EAA⊤. While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task’s influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don’t point directly toward the identity solution. The gradient depends on both the deviation from identity (E) and the activation covariance (AA⊤), with the network balancing this against the primary task loss. Since the self-modeling prediction isn’t just a separate output block—it’s predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (AA⊤), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero W−I difference during training.
Thanks for this! Consider the self-modeling loss gradient: ∂Lself/∂W=2(WA−A)A⊤=2EAA⊤. While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task’s influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don’t point directly toward the identity solution. The gradient depends on both the deviation from identity (E) and the activation covariance (AA⊤), with the network balancing this against the primary task loss. Since the self-modeling prediction isn’t just a separate output block—it’s predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (AA⊤), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero W−I difference during training.