import torch
import torch.nn.functional as F
a = torch.tensor([0., 16.], requires_grad=True)
al = F.log_softmax(a, dim=-1)
print('Log softmax output for 16', al)
al[1].backward()
print('Log softmax grad for 16', a.grad)
b = torch.tensor([0., 17.], requires_grad=True)
bl = F.log_softmax(b, dim=-1)
print('Log softmax output for 17', bl)
bl[1].backward()
print('Log softmax grad for 17', b.grad)
a = torch.tensor([0., 16.], requires_grad=True, dtype=torch.float64)
al = F.log_softmax(a, dim=-1)
print('Log softmax output for 16', al)
al[1].backward()
print('Log softmax grad for 16', a.grad)
b = torch.tensor([0., 17.], requires_grad=True, dtype=torch.float64)
bl = F.log_softmax(b, dim=-1)
print('Log softmax output for 17', bl)
bl[1].backward()
print('Log softmax grad for 17', b.grad)
This outputs:
Log softmax output for 16 tensor([-1.6000e+01, -1.1921e-07], grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 16 tensor([-1.1254e-07, 1.1921e-07])
Log softmax output for 17 tensor([-17., 0.], grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 17 tensor([-4.1399e-08, 0.0000e+00])
Log softmax output for 16 tensor([-1.6000e+01, -1.1254e-07], dtype=torch.float64,
grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 16 tensor([-1.1254e-07, 1.1254e-07], dtype=torch.float64)
Log softmax output for 17 tensor([-1.7000e+01, -4.1399e-08], dtype=torch.float64,
grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 17 tensor([-4.1399e-08, 4.1399e-08], dtype=torch.float64)
Thanks for writing this out. The difference from mine is that you take the gradient of the second component while I took the gradient of the sum of the log_softmax outputs, which pushes the gradients towards +1 or −1 and hides the problem. I’m still confused how the large effects you see could come down to a difference of gradient = −4.1399e-08 versus 0. AdamW includes an ‘epsilon’ term in the denominator of (default) 1e-8, which means that I don’t see how this difference can change anything significantly. I assume you’re using the default epsilon value? I just don’t see how this can make such a difference.
Here’s a minimal counter-example:
This outputs:
Thanks for writing this out. The difference from mine is that you take the gradient of the second component while I took the gradient of the sum of the log_softmax outputs, which pushes the gradients towards +1 or −1 and hides the problem. I’m still confused how the large effects you see could come down to a difference of gradient = −4.1399e-08 versus 0. AdamW includes an ‘epsilon’ term in the denominator of (default) 1e-8, which means that I don’t see how this difference can change anything significantly. I assume you’re using the default epsilon value? I just don’t see how this can make such a difference.