One common question is whether the self-modeling task, which involves predicting a layer’s own activations, would cause the network to merely learn the identity function. Intuitively, this might seem like an optimal outcome for minimizing the self-modeling loss.
I found this section confusing. If the identity function is the global optimum for self-modeling loss, isn’t it kinda surprising that training doesn’t converge to the identity function? Or does the identity function make it worse at the primary task? If so, why?
[I’m sure this is going to be wrong in some embarrassing way, but what the heck… What I’m imagining right now is as follows. There’s an N×1 activation vector in the second-to-last layer of the DNN, and then a M×N weight matrix constituting the linear transformation, and you multiply them to get a M×1 output layer of the DNN. The first M-N entries of that output layer are the “primary task” outputs, and the bottom N entries are the “self-modeling” outputs, which are compared to the earlier N×1 activation vector mentioned above. And when you’re talking about “identity matrix”, you actually mean that the bottom N×N block of the weight matrix is close to an identity matrix but evidently not quite. (Oops I’m leaving out the bias vector, oh well.) If I’m right so far, then it wouldn’t be the case that the identity matrix makes the thing worse at the primary task, because the top (M-N)×N block of the weight matrix can still be anything. Where am I going wrong?]
I found this section confusing. If the identity function is the global optimum for self-modeling loss, isn’t it kinda surprising that training doesn’t converge to the identity function? Or does the identity function make it worse at the primary task? If so, why?
[I’m sure this is going to be wrong in some embarrassing way, but what the heck… What I’m imagining right now is as follows. There’s an N×1 activation vector in the second-to-last layer of the DNN, and then a M×N weight matrix constituting the linear transformation, and you multiply them to get a M×1 output layer of the DNN. The first M-N entries of that output layer are the “primary task” outputs, and the bottom N entries are the “self-modeling” outputs, which are compared to the earlier N×1 activation vector mentioned above. And when you’re talking about “identity matrix”, you actually mean that the bottom N×N block of the weight matrix is close to an identity matrix but evidently not quite. (Oops I’m leaving out the bias vector, oh well.) If I’m right so far, then it wouldn’t be the case that the identity matrix makes the thing worse at the primary task, because the top (M-N)×N block of the weight matrix can still be anything. Where am I going wrong?]