Daniel Murfet comments on Investigating the learning coefficient of modular addition: hackathon project

Daniel Murfet 17 Oct 2023 21:23 UTC
6 points
0
To see this, we use a slight refinement of the dynamical estimator, where we restrict sampling to lie within the normal hyperplane of the gradient vector at initialization, which seems to make this behavior more robust.
Could you explain the intuition behind using the gradient vector at initialization? Is this based on some understanding of the global training dynamics of this particular network on this dataset?
- Dmitry Vaintrob 17 Oct 2023 21:30 UTC
  4 points
  0
  Parent
  Oh I can see how this could be confusing. We’re sampling at every step in the orthogonal complement to the gradient at that step (“initialization” here refers to the beginning of sampling, i.e., we don’t update the normal vector during sampling). And the reason to do this is that we’re hoping to prevent the sampler from quickly leaving the unstable point and jumping into a lower-loss basin (by restricting we are guaranteeing that the unstable point is a critical point)
  - Daniel Murfet 17 Oct 2023 21:45 UTC
    2 points
    2
    Parent
    Oh that makes a lot of sense, yes.