gwern comments on Hypothesis: gradient descent prefers general circuits