SGD has a strong inherent simplicity bias, even without weight regularization, and this is fairly well known in DL literature (I could probably find hundreds of examples if I had the time—I do not). By SGD I specifically mean SGD variants that don’t use a 2nd order approx (such as Adam). The are many papers which find approx 2nd-order variance adjusted optimizers like Adam have various generalization/overfitting issues compared to SGD, this comes up over and over, such that it’s fairly common to use some additional regularization with Adam.
It’s also pretty intuitively obvious why SGD has a strong simplicity prior if you just think through some simple examples—as SGD doesn’t move in the direction that minimizes loss, it moves in the parsimonious direction which minimizes loss per unit weight distance (moved away from the init). 2nd order optimizers like adam can move more directly in the direction of lower loss.
Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.
Yes. (Note that “randomly sample from the set of all low loss NN parameter configurations” goes hand in hand with there being a bias towards simplicity, it’s not a contradiction. Is that maybe what’s going on here—people misinterpreted Bensinger as somehow not realizing simpler configurations are more likely?)
My prior is that DL has a great amount of wierd domain knowledge which is mysterious to those who haven’t spent years studying it, and years studying DL correlates with strong disagreement with the sequences/MIRI positions in many fundamentals. I trace all this back to EY over-updating too much on ev psych and not reading enough neuroscience and early DL.
So anyway, a sentence like “randomly sample from the set of all low loss NN parameter configurations” is not one I would use or expect a DL-insider to use and sounds more like something a MIRI/LW person would say—in part yes because I don’t generally expect MIRI/LW folks to be especially aware of the intrinsic SGD simplicity prior. The more correct statement is “randomly sample from the set of all simple low loss configs” or similar.
But it’s also not quite clear to me how relevant that subpoint is, just sharing my impression.
IMO this seems like a strawman. When talking to MIRI people it’s pretty clear they have thought a good amount about the inductive biases of SGD, including an associated simplicity prior.
Sure it will clearly be a strawman for some individuals—the point of my comment is to explain how someone like myself could potentially misinterpret Bensinger and why. (As I don’t know him very well, my brain models him as a generic MIRI/LW type)
If you sampled a random plan from the space of all writable plans (weighted by length, in any extant formal language), and all we knew about the plan is that executing it would successfully achieve some superhumanly ambitious technological goal like “invent fast-running whole-brain emulation”, then hitting a button to execute the plan would kill all humans, with very high probability.
(emphasis mine)
That sounds a whole lot like it’s invoking a simplicity prior to me!
Note I didn’t actually reply to that quote. Sure that’s an explicit simplicity prior. However there’s a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).
SGD has a strong inherent simplicity bias, even without weight regularization, and this is fairly well known in DL literature (I could probably find hundreds of examples if I had the time—I do not). By SGD I specifically mean SGD variants that don’t use a 2nd order approx (such as Adam). The are many papers which find approx 2nd-order variance adjusted optimizers like Adam have various generalization/overfitting issues compared to SGD, this comes up over and over, such that it’s fairly common to use some additional regularization with Adam.
It’s also pretty intuitively obvious why SGD has a strong simplicity prior if you just think through some simple examples—as SGD doesn’t move in the direction that minimizes loss, it moves in the parsimonious direction which minimizes loss per unit weight distance (moved away from the init). 2nd order optimizers like adam can move more directly in the direction of lower loss.
Empirically, the inductive bias that you get when you train with SGD, and similar optimisers, is in fact quite similar to the inductive bias that you would get, if you were to repeatedly re-initialise a neural network until you randomly get a set of weights that yield a low loss. Which optimiser you use does have an effect as well, but this is very small by comparison. See this paper.
Yes. (Note that “randomly sample from the set of all low loss NN parameter configurations” goes hand in hand with there being a bias towards simplicity, it’s not a contradiction. Is that maybe what’s going on here—people misinterpreted Bensinger as somehow not realizing simpler configurations are more likely?)
My prior is that DL has a great amount of wierd domain knowledge which is mysterious to those who haven’t spent years studying it, and years studying DL correlates with strong disagreement with the sequences/MIRI positions in many fundamentals. I trace all this back to EY over-updating too much on ev psych and not reading enough neuroscience and early DL.
So anyway, a sentence like “randomly sample from the set of all low loss NN parameter configurations” is not one I would use or expect a DL-insider to use and sounds more like something a MIRI/LW person would say—in part yes because I don’t generally expect MIRI/LW folks to be especially aware of the intrinsic SGD simplicity prior. The more correct statement is “randomly sample from the set of all simple low loss configs” or similar.
But it’s also not quite clear to me how relevant that subpoint is, just sharing my impression.
IMO this seems like a strawman. When talking to MIRI people it’s pretty clear they have thought a good amount about the inductive biases of SGD, including an associated simplicity prior.
Sure it will clearly be a strawman for some individuals—the point of my comment is to explain how someone like myself could potentially misinterpret Bensinger and why. (As I don’t know him very well, my brain models him as a generic MIRI/LW type)
I want to revisit what Rob actually wrote:
(emphasis mine)
That sounds a whole lot like it’s invoking a simplicity prior to me!
Note I didn’t actually reply to that quote. Sure that’s an explicit simplicity prior. However there’s a large difference under the hood between using an explicit simplicity prior on plan length vs an implicit simplicity prior on the world and action models which generate plans. The latter is what is more relevant for intrinsic similarity to human though processes (or not).