(Maybe you weren’t disagreeing with Zach and were just saying the same thing a different way?)
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Saying that MLK was a “criminal” is one way of saying that MLK thought and acted as though he had a moral responsibility to break unjust laws and to take direct action.
(This is an exaggeration but I think it is directionally correct. Certainly when I read the title “neural networks are fundamentally Bayesian” I was thinking of something very different.)
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.
I’m honestly not sure, I just wasn’t really sure what he meant when he said that the Bayesian and the Kolmogorov complexity stuff were “distractions from the main point”.
Haha. That’s obviously not what we’re trying to do here, but I do see what you mean. I originally wanted to express these ideas in more geometric language, rather than probability-theoretic language, but in the end we decided to go for more probability-theoretic language anyway.
I agree that this arguably could be mildly misleading. For example, the correspondence between SGD and Bayesian sampling only really holds for some initialisation distributions. If you deterministically initialise your neural network to the origin (i.e., all zero weights) then SGD will most certainly not behave like Bayesian sampling with the initialisation distribution as its prior. Then again, the probability-theoretic formulation might make other things more intuitive.