This comment is about why we were getting different MSE numbers. The answer is (mostly) benign—a matter of different scale factors. My parallel comment, which discusses why we were getting different CE diff numbers is the more important one.
When you compute MSE loss between some activations x and their reconstruction ^x, you divide by variance of x, as estimated from the data in a batch. I’ll note that this doesn’t seem like a great choice to me. Looking at the resulting training loss:
∥x−^x∥22/Var(x)+λ∥f∥1
where f is the encoding of x by the autoencoder and λ is the L1 regularization constant, we see that if you scale x by some constant α, this will have no effect on the first term, but will scale the second term by α. So if activations generically become larger in later layers, this will mean that the sparsity term becomes automatically more strongly weighted.
I think a more principled choice would be something like
∥x−^x∥2+λ∥f∥1
where we’re no longer normalizing by the variance, and are also using sqrt(MSE) instead of MSE. (This is what the dictionary_learning repo does.) When you scale x by a constant α, this entire expression scales by a factor of α, so that the balance between reconstruction and sparsity remains the same. (On the other hand, this will mean that you might need to scale the learning rate by 1/α, so perhaps it would be reasonable to divide through this expression by ∥x∥2? I’m not sure.)
Also, one other thing I noticed: something which we both did was to compute MSE by taking the mean over the squared difference over the batch dimension and the activation dimension. But this isn’t quite what MSE usually means; really we should be summing over the activation dimension and taking the mean over the batch dimension. That means that both of our MSEs are erroneously divided by a factor of the hidden dimension (768 for you and 512 for me).
This constant factor isn’t a huge deal, but it does mean that:
The MSE losses that we’re reporting are deceptively low, at least for the usual interpretation of “mean squared error”
If we decide to fix this, we’ll need to both scale up our L1 regularization penalty by a factor of the hidden dimension (and maybe also scale down the learning rate).
This is a good lesson on how MSE isn’t naturally easy to interpret and we should maybe just be reporting percent variance explained. But if we are going to report MSE (which I have been), I think we should probably report it according to the usual definition.
This comment is about why we were getting different MSE numbers. The answer is (mostly) benign—a matter of different scale factors. My parallel comment, which discusses why we were getting different CE diff numbers is the more important one.
When you compute MSE loss between some activations x and their reconstruction ^x, you divide by variance of x, as estimated from the data in a batch. I’ll note that this doesn’t seem like a great choice to me. Looking at the resulting training loss:
∥x−^x∥22/Var(x)+λ∥f∥1
where f is the encoding of x by the autoencoder and λ is the L1 regularization constant, we see that if you scale x by some constant α, this will have no effect on the first term, but will scale the second term by α. So if activations generically become larger in later layers, this will mean that the sparsity term becomes automatically more strongly weighted.
I think a more principled choice would be something like
∥x−^x∥2+λ∥f∥1
where we’re no longer normalizing by the variance, and are also using sqrt(MSE) instead of MSE. (This is what the
dictionary_learning
repo does.) When you scale x by a constant α, this entire expression scales by a factor of α, so that the balance between reconstruction and sparsity remains the same. (On the other hand, this will mean that you might need to scale the learning rate by 1/α, so perhaps it would be reasonable to divide through this expression by ∥x∥2? I’m not sure.)Also, one other thing I noticed: something which we both did was to compute MSE by taking the mean over the squared difference over the batch dimension and the activation dimension. But this isn’t quite what MSE usually means; really we should be summing over the activation dimension and taking the mean over the batch dimension. That means that both of our MSEs are erroneously divided by a factor of the hidden dimension (768 for you and 512 for me).
This constant factor isn’t a huge deal, but it does mean that:
The MSE losses that we’re reporting are deceptively low, at least for the usual interpretation of “mean squared error”
If we decide to fix this, we’ll need to both scale up our L1 regularization penalty by a factor of the hidden dimension (and maybe also scale down the learning rate).
This is a good lesson on how MSE isn’t naturally easy to interpret and we should maybe just be reporting percent variance explained. But if we are going to report MSE (which I have been), I think we should probably report it according to the usual definition.