Thanks for writing this out. The difference from mine is that you take the gradient of the second component while I took the gradient of the sum of the log_softmax outputs, which pushes the gradients towards +1 or −1 and hides the problem. I’m still confused how the large effects you see could come down to a difference of gradient = −4.1399e-08 versus 0. AdamW includes an ‘epsilon’ term in the denominator of (default) 1e-8, which means that I don’t see how this difference can change anything significantly. I assume you’re using the default epsilon value? I just don’t see how this can make such a difference.
Thanks for writing this out. The difference from mine is that you take the gradient of the second component while I took the gradient of the sum of the log_softmax outputs, which pushes the gradients towards +1 or −1 and hides the problem. I’m still confused how the large effects you see could come down to a difference of gradient = −4.1399e-08 versus 0. AdamW includes an ‘epsilon’ term in the denominator of (default) 1e-8, which means that I don’t see how this difference can change anything significantly. I assume you’re using the default epsilon value? I just don’t see how this can make such a difference.