To see this, we use a slight refinement of the dynamical estimator, where we restrict sampling to lie within the normal hyperplane of the gradient vector at initialization, which seems to make this behavior more robust.
Could you explain the intuition behind using the gradient vector at initialization? Is this based on some understanding of the global training dynamics of this particular network on this dataset?
Oh I can see how this could be confusing. We’re sampling at every step in the orthogonal complement to the gradient at that step (“initialization” here refers to the beginning of sampling, i.e., we don’t update the normal vector during sampling). And the reason to do this is that we’re hoping to prevent the sampler from quickly leaving the unstable point and jumping into a lower-loss basin (by restricting we are guaranteeing that the unstable point is a critical point)
Could you explain the intuition behind using the gradient vector at initialization? Is this based on some understanding of the global training dynamics of this particular network on this dataset?
Oh I can see how this could be confusing. We’re sampling at every step in the orthogonal complement to the gradient at that step (“initialization” here refers to the beginning of sampling, i.e., we don’t update the normal vector during sampling). And the reason to do this is that we’re hoping to prevent the sampler from quickly leaving the unstable point and jumping into a lower-loss basin (by restricting we are guaranteeing that the unstable point is a critical point)
Oh that makes a lot of sense, yes.