I think this is false. For a given architecture + training process, it’s entirely possible for there to be an attractor into which 99.9999999999999% of all randomly initialized training processes fall, but for it to still be highly path dependent in the relevant sense. The reason is because it’s actually quite easy for “simple nudges” to apply the ~ 50 bits of optimization pressure needed to make a 0.0000000000001% outcome happen. E.g., training for 100 steps with an L1 regularizer will get you a model that’s incredibly unlikely to be sampled by your random initialization process.
It can be the case that almost all random initializations train out to the same end state, and also that fairly simple interventions can put the training trajectory on the path to a different end state.
I think we actually have the same model here, but interpret the phrase “path dependence” differently. If the question is whether we can intentionally apply 50 bits of optimization to kick the thing into a different attractor, then yeah, I agree that is very probably possible. I just wouldn’t call that “path dependence”, since on the distribution of the training process the path basically does not matter.
I think we actually have the same model here, but interpret the phrase “path dependence” differently. If the question is whether we can intentionally apply 50 bits of optimization to kick the thing into a different attractor, then yeah, I agree that is very probably possible. I just wouldn’t call that “path dependence”, since on the distribution of the training process the path basically does not matter.