jacob_cannell comments on How I’m thinking about GPT-N

jacob_cannell 17 Jan 2022 23:32 UTC
4 points

However neural nets in deep learning are trained via SGD, not with Bayesian updating

SGD is a form of efficient approximate Bayesian updating. More specifically it’s a local linear 1st order approximation. As the step size approaches zero this approximation becomes tight, under some potentially enormous simplifying assumptions of unit variance (which are in practice enforced through initialization and explicit normalization).

But anyway that’s not directly relevant, as Bayesian updating doesn’t have some monopoly on entropy/complexity tradeoffs.

If you want to be ‘rigorous’, then you shouldn’t have confidently said:

Even if biasing towards simpler models is a good idea, we don’t have a good way of doing this in deep learning yet, apart from restricting the number of parameters,

(As you can’t rigorously back that statement up). Regularization to bias towards simpler models in DL absolutely works well, regardless of whether you understand it or find the provided explanations satisfactory.
- delton137 18 Jan 2022 0:08 UTC
  3 points
  Parent
  SGD is a form of efficient approximate Bayesian updating.
  Yeah I saw you were arguing that in one of your posts. I’ll take a closer look. I honestly have not heard of this before.
  Regarding my statement—I agree looking back at it it is horribly sloppy and sounds absurd, but when I was writing I was just thinking about how all L1 and L2 regularization do is bias towards smaller weights—the models still take up the same amount of space on disk and require the same amount amount of compute to run in terms of FLOPs. But yes you’re right they make the models easier to approximate.
  - jacob_cannell 18 Jan 2022 0:57 UTC
    2 points
    Parent
    So actually L1/L2 regularization does allow you to compress the model by reducing entropy, as evidenced by the fact that any effective pruning/quantization system necessarily involves some strong regularizer applied during training or after.
    
    The model itself can’t possibly know or care whether you later actually compress said weights or not, so it’s never the actual compression itself that matters, vs the inherent compressibility (which comes from the regularization).