One common question is whether the self-modeling task, which involves predicting a layer’s own activations, would cause the network to merely learn the identity function. Intuitively, this might seem like an optimal outcome for minimizing the self-modeling loss.
I found this section confusing. If the identity function is the global optimum for self-modeling loss, isn’t it kinda surprising that training doesn’t converge to the identity function? Or does the identity function make it worse at the primary task? If so, why?
[I’m sure this is going to be wrong in some embarrassing way, but what the heck… What I’m imagining right now is as follows. There’s an N×1 activation vector in the second-to-last layer of the DNN, and then a M×N weight matrix constituting the linear transformation, and you multiply them to get a M×1 output layer of the DNN. The first (M–N) entries of that output layer are the “primary task” outputs, and the bottom N entries are the “self-modeling” outputs, which are compared to the earlier N×1 activation vector mentioned above. And when you’re talking about “identity matrix”, you actually mean that the bottom N×N block of the weight matrix is close to an identity matrix but evidently not quite. (Oops I’m leaving out the bias vector, oh well.) If I’m right so far, then it wouldn’t be the case that the identity matrix makes the thing worse at the primary task, because the top (M-N)×N block of the weight matrix can still be anything. Where am I going wrong?]
Thanks for this! Consider the self-modeling loss gradient: ∂Lself/∂W=2(WA−A)A⊤=2EAA⊤. While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task’s influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don’t point directly toward the identity solution. The gradient depends on both the deviation from identity (E) and the activation covariance (AA⊤), with the network balancing this against the primary task loss. Since the self-modeling prediction isn’t just a separate output block—it’s predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (AA⊤), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero W−I difference during training.
I found this section confusing. If the identity function is the global optimum for self-modeling loss, isn’t it kinda surprising that training doesn’t converge to the identity function? Or does the identity function make it worse at the primary task? If so, why?
[I’m sure this is going to be wrong in some embarrassing way, but what the heck… What I’m imagining right now is as follows. There’s an N×1 activation vector in the second-to-last layer of the DNN, and then a M×N weight matrix constituting the linear transformation, and you multiply them to get a M×1 output layer of the DNN. The first (M–N) entries of that output layer are the “primary task” outputs, and the bottom N entries are the “self-modeling” outputs, which are compared to the earlier N×1 activation vector mentioned above. And when you’re talking about “identity matrix”, you actually mean that the bottom N×N block of the weight matrix is close to an identity matrix but evidently not quite. (Oops I’m leaving out the bias vector, oh well.) If I’m right so far, then it wouldn’t be the case that the identity matrix makes the thing worse at the primary task, because the top (M-N)×N block of the weight matrix can still be anything. Where am I going wrong?]
Thanks for this! Consider the self-modeling loss gradient: ∂Lself/∂W=2(WA−A)A⊤=2EAA⊤. While the identity function would globally minimize the self-modeling loss with zero loss for all inputs (effectively eliminating the task’s influence by zeroing out its gradients), SGD learns local optima rather than global optima, and the gradients don’t point directly toward the identity solution. The gradient depends on both the deviation from identity (E) and the activation covariance (AA⊤), with the network balancing this against the primary task loss. Since the self-modeling prediction isn’t just a separate output block—it’s predicting the full activation pattern—the interaction between the primary task loss, activation covariance structure (AA⊤), and need to maintain useful representations creates a complex optimization landscape where local optima dominate. We see this empirically in the consistent non-zero W−I difference during training.