The main point, as I see it, is essentially that functions with good generalisation correspond to large volumes in parameter-space, and that SGD finds functions with a probability roughly proportional to their volume.
This seems right, but I’m not sure how that’s different from Zach’s phrasing of the main point? Zach’s phrasing was “SGD approximately equals random sampling”, and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
Saying that SGD is “Bayesian” is one way of saying the latter
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
the Kolmogorov complexity stuff is a way to formalise some intuitions around the former.
I’ve discussed this above, I’ll continue the discussion there.
The rest of the comment is about stuff that I didn’t have a strong opinion on, so I’ll leave it for Zach to answer if he wants.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
This seems right, but I’m not sure how that’s different from Zach’s phrasing of the main point? Zach’s phrasing was “SGD approximately equals random sampling”, and random sampling finds functions with probability exactly proportional to their volume. Combine that with the fact that empirically we get good generalization and we get the thing you said.
(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
This feels similar to:
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
I’ve discussed this above, I’ll continue the discussion there.
The rest of the comment is about stuff that I didn’t have a strong opinion on, so I’ll leave it for Zach to answer if he wants.
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.