Neel Nanda comments on A Mechanistic Interpretability Analysis of Grokking

Neel Nanda 21 Aug 2022 1:28 UTC

4 points

Here’s a minimal counter-example:

import torch
import torch.nn.functional as F
a = torch.tensor([0., 16.], requires_grad=True)
al = F.log_softmax(a, dim=-1)
print('Log softmax output for 16', al)
al[1].backward()
print('Log softmax grad for 16', a.grad)

b = torch.tensor([0., 17.], requires_grad=True)
bl = F.log_softmax(b, dim=-1)
print('Log softmax output for 17', bl)
bl[1].backward()
print('Log softmax grad for 17', b.grad)

a = torch.tensor([0., 16.], requires_grad=True, dtype=torch.float64)
al = F.log_softmax(a, dim=-1)
print('Log softmax output for 16', al)
al[1].backward()
print('Log softmax grad for 16', a.grad)

b = torch.tensor([0., 17.], requires_grad=True, dtype=torch.float64)
bl = F.log_softmax(b, dim=-1)
print('Log softmax output for 17', bl)
bl[1].backward()
print('Log softmax grad for 17', b.grad)

This outputs:

Log softmax output for 16 tensor([-1.6000e+01, -1.1921e-07], grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 16 tensor([-1.1254e-07,  1.1921e-07])
Log softmax output for 17 tensor([-17.,   0.], grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 17 tensor([-4.1399e-08,  0.0000e+00])
Log softmax output for 16 tensor([-1.6000e+01, -1.1254e-07], dtype=torch.float64,
       grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 16 tensor([-1.1254e-07,  1.1254e-07], dtype=torch.float64)
Log softmax output for 17 tensor([-1.7000e+01, -4.1399e-08], dtype=torch.float64,
       grad_fn=<LogSoftmaxBackward0>)
Log softmax grad for 17 tensor([-4.1399e-08,  4.1399e-08], dtype=torch.float64)

tgb 21 Aug 2022 12:43 UTC
2 points
0
Parent
Thanks for writing this out. The difference from mine is that you take the gradient of the second component while I took the gradient of the sum of the log_softmax outputs, which pushes the gradients towards +1 or −1 and hides the problem. I’m still confused how the large effects you see could come down to a difference of gradient = −4.1399e-08 versus 0. AdamW includes an ‘epsilon’ term in the denominator of (default) 1e-8, which means that I don’t see how this difference can change anything significantly. I assume you’re using the default epsilon value? I just don’t see how this can make such a difference.