I’d happily take the other side of that bet. E.g., look at this website for an example of training a 500 neuron wide, 2-layer fully connected ReLu network on toy data, with a selector that lets you apply regularizers to the training process. If you simply train with no regularizer, you get the following decision boundary:
If you train with an L1 regularizer, you get this boundary:
However, if you first train with the L1 regularizer for ~ 100 steps, then switch over to no regularizer, you get this boundary, which persists for at least 5,000 training steps:
If we were going to find path-independence anywhere, I think it would be in these sorts of very simple datasets, with wide, highly overparameterized models, trained on IID data using exact gradients. But even here, SGD seems quite path dependent.
Edited to add:
...replicability already gives us a ton of bits on the question.
I think this is false. For a given architecture + training process, it’s entirely possible for there to be an attractor into which 99.9999999999999% of all randomly initialized training processes fall, but for it to still be highly path dependent in the relevant sense. The reason is because it’s actually quite easy for “simple nudges” to apply the ~ 50 bits of optimization pressure needed to make a 0.0000000000001% outcome happen. E.g., training for 100 steps with an L1 regularizer will get you a model that’s incredibly unlikely to be sampled by your random initialization process.
It can be the case that almost all random initializations train out to the same end state, and also that fairly simple interventions can put the training trajectory on the path to a different end state.
Broadly agree with this comment. I’d buy something like “low path-dependence for loss, moderate-to-high for specific representations and behaviours”—see e.g. https://arxiv.org/abs/1911.02969
I think this is false. For a given architecture + training process, it’s entirely possible for there to be an attractor into which 99.9999999999999% of all randomly initialized training processes fall, but for it to still be highly path dependent in the relevant sense. The reason is because it’s actually quite easy for “simple nudges” to apply the ~ 50 bits of optimization pressure needed to make a 0.0000000000001% outcome happen. E.g., training for 100 steps with an L1 regularizer will get you a model that’s incredibly unlikely to be sampled by your random initialization process.
It can be the case that almost all random initializations train out to the same end state, and also that fairly simple interventions can put the training trajectory on the path to a different end state.
I think we actually have the same model here, but interpret the phrase “path dependence” differently. If the question is whether we can intentionally apply 50 bits of optimization to kick the thing into a different attractor, then yeah, I agree that is very probably possible. I just wouldn’t call that “path dependence”, since on the distribution of the training process the path basically does not matter.
I’d happily take the other side of that bet. E.g., look at this website for an example of training a 500 neuron wide, 2-layer fully connected ReLu network on toy data, with a selector that lets you apply regularizers to the training process. If you simply train with no regularizer, you get the following decision boundary:
If you train with an L1 regularizer, you get this boundary:
However, if you first train with the L1 regularizer for ~ 100 steps, then switch over to no regularizer, you get this boundary, which persists for at least 5,000 training steps:
If we were going to find path-independence anywhere, I think it would be in these sorts of very simple datasets, with wide, highly overparameterized models, trained on IID data using exact gradients. But even here, SGD seems quite path dependent.
Edited to add:
I think this is false. For a given architecture + training process, it’s entirely possible for there to be an attractor into which 99.9999999999999% of all randomly initialized training processes fall, but for it to still be highly path dependent in the relevant sense. The reason is because it’s actually quite easy for “simple nudges” to apply the ~ 50 bits of optimization pressure needed to make a 0.0000000000001% outcome happen. E.g., training for 100 steps with an L1 regularizer will get you a model that’s incredibly unlikely to be sampled by your random initialization process.
It can be the case that almost all random initializations train out to the same end state, and also that fairly simple interventions can put the training trajectory on the path to a different end state.
Broadly agree with this comment. I’d buy something like “low path-dependence for loss, moderate-to-high for specific representations and behaviours”—see e.g. https://arxiv.org/abs/1911.02969
I think we actually have the same model here, but interpret the phrase “path dependence” differently. If the question is whether we can intentionally apply 50 bits of optimization to kick the thing into a different attractor, then yeah, I agree that is very probably possible. I just wouldn’t call that “path dependence”, since on the distribution of the training process the path basically does not matter.