Thanks for taking the time to respond. To explain my third question, my take on your path dependent analysis is that you have two basic assumptions:
each step of training updates the behaviour in the direction of “locally minimising loss over all training data”
training will not move the model between states with equal loss over all training data
Holding the training data fixed, you get the same sequence of updates no matter which fragment is used to work out the next training step. So you can get very different behaviour for different training data, but not for different orderings of the same training data—so, to begin with, I’m not sure if these assumptions actually yield path dependence.
Secondly, assumption 2 might seem like an assumption that gets you path dependence—e.g. if there are lots of global minima, then you can just end up at one of them randomly. However, replacing assumption 2 with some kind of “given a collection of states with minimal loss, the model always goes to some preferred state from this collection” doesn’t get you the path independent analysis. Instead of “the model converges to some behaviour that optimally trades off loss and inductive bias”, you end up with “the model converges to some behaviour that minimises training set loss”. That is, your analysis of “path independence” seems to be better described as “inductive bias independence” (or at least “weak inductive bias”), and the appropriate conclusion of this analysis would therefore seem to be that the model doesn’t generalise at all (not that it is merely deceptive).
Thanks for taking the time to respond. To explain my third question, my take on your path dependent analysis is that you have two basic assumptions:
each step of training updates the behaviour in the direction of “locally minimising loss over all training data”
training will not move the model between states with equal loss over all training data
Holding the training data fixed, you get the same sequence of updates no matter which fragment is used to work out the next training step. So you can get very different behaviour for different training data, but not for different orderings of the same training data—so, to begin with, I’m not sure if these assumptions actually yield path dependence.
Secondly, assumption 2 might seem like an assumption that gets you path dependence—e.g. if there are lots of global minima, then you can just end up at one of them randomly. However, replacing assumption 2 with some kind of “given a collection of states with minimal loss, the model always goes to some preferred state from this collection” doesn’t get you the path independent analysis. Instead of “the model converges to some behaviour that optimally trades off loss and inductive bias”, you end up with “the model converges to some behaviour that minimises training set loss”. That is, your analysis of “path independence” seems to be better described as “inductive bias independence” (or at least “weak inductive bias”), and the appropriate conclusion of this analysis would therefore seem to be that the model doesn’t generalise at all (not that it is merely deceptive).