This essay includes everything. A rant against frequentism and the superiority of bayes. A rant against modern academic institutions. A rant against mainstream quantum physics. A section about how mainstream AI is too ad hoc and not grounded in perfect bayesian math. A closing section about sticking to your non-mainstream beliefs and ignoring critics.
I’m not really qualified to speak about most of it. The part about AI, particularly, bothered me. He attacks neural networks, and suggests that bayesian networks are best.
I initially wrote a big rant about how terribly he misunderstands neural networks. But the more I think about it, the more I like the idea of bayesian networks. The idea of ideal, perfect, universal methods appeals to my mind a great deal.
And that’s a serious problem for me. I once got very into libertarianism over that. And then crazy AI methods that are totally impractical in reality.
And thinking about it some more; Bayesian networks are cool, but I don’t think they could replace all of ML. I mean half of what neural networks do isn’t just better inference. Sometimes we have plenty of training data and overfitting isn’t much of an issue. It’s just getting a model to fit to the data at all.
Bayes theorem doesn’t say anything about optimization. It’s terribly expensive to approximate. And Jaynes’ rant against non-linear functions doesn’t even make any sense outside of boolean functions (and even there it isn’t necessarily optimal; you would have to learn a lookup table for each node that explodes exponentially with the number of inputs. (And if you are going to go full Bayesian, why stop there? Why not go to full Solomonoff Induction (or some approximation of it, at least.))
I don’t think he suggests bayesian networks (which, to me, mean the causal networks of Pearl et al). Rather, he is literally suggesting trying to learn by Bayesian inference. His comments about nonlinearity I think are just to the effect that one shoudn’t have to introduce nonlinearity with sigmoid activation functions, one should have nonlinearity naturally from Bayesian updates. But yeah, I think it’s quite impractical.
E.g. suppose you wanted to build an email spam filter, and wanted P(spam). A (non-naive) Bayesian approach to this classification problem might involve a prior over some large population of email-generating processes. Every time you get a training email, you update your probability that a generic email comes from a particular process, and what their probability was of producing spam. When run on a test email, the spam filter goes through every single hypothesis, evaluates its probability of producing this email, and then takes a weighted average of the spam probabilities of those hypotheses to get its spam / not spam verdict. This seems like too much work.
I don’t know, that comment really seemed to suggest Bayesian networks. I guess you could allow for a distribution of possible activation functions, but that doesn’t really fit what he said about learning the “exact” nonlinear function for every possible function. That fits more with bayes nets, which use a lookup table for every node.
Your example sounds like a bayesian net. But it doesn’t really fit his description of learning optimal nonlinearities for functions.
I stumbled across this document. I believe it may have influenced a young Eliezer Yudkowsky. He’s certainly shown reverence for the author before.
This essay includes everything. A rant against frequentism and the superiority of bayes. A rant against modern academic institutions. A rant against mainstream quantum physics. A section about how mainstream AI is too ad hoc and not grounded in perfect bayesian math. A closing section about sticking to your non-mainstream beliefs and ignoring critics.
I’m not really qualified to speak about most of it. The part about AI, particularly, bothered me. He attacks neural networks, and suggests that bayesian networks are best.
I initially wrote a big rant about how terribly he misunderstands neural networks. But the more I think about it, the more I like the idea of bayesian networks. The idea of ideal, perfect, universal methods appeals to my mind a great deal.
And that’s a serious problem for me. I once got very into libertarianism over that. And then crazy AI methods that are totally impractical in reality.
And thinking about it some more; Bayesian networks are cool, but I don’t think they could replace all of ML. I mean half of what neural networks do isn’t just better inference. Sometimes we have plenty of training data and overfitting isn’t much of an issue. It’s just getting a model to fit to the data at all.
Bayes theorem doesn’t say anything about optimization. It’s terribly expensive to approximate. And Jaynes’ rant against non-linear functions doesn’t even make any sense outside of boolean functions (and even there it isn’t necessarily optimal; you would have to learn a lookup table for each node that explodes exponentially with the number of inputs. (And if you are going to go full Bayesian, why stop there? Why not go to full Solomonoff Induction (or some approximation of it, at least.))
I don’t think he suggests bayesian networks (which, to me, mean the causal networks of Pearl et al). Rather, he is literally suggesting trying to learn by Bayesian inference. His comments about nonlinearity I think are just to the effect that one shoudn’t have to introduce nonlinearity with sigmoid activation functions, one should have nonlinearity naturally from Bayesian updates. But yeah, I think it’s quite impractical.
E.g. suppose you wanted to build an email spam filter, and wanted P(spam). A (non-naive) Bayesian approach to this classification problem might involve a prior over some large population of email-generating processes. Every time you get a training email, you update your probability that a generic email comes from a particular process, and what their probability was of producing spam. When run on a test email, the spam filter goes through every single hypothesis, evaluates its probability of producing this email, and then takes a weighted average of the spam probabilities of those hypotheses to get its spam / not spam verdict. This seems like too much work.
I don’t know, that comment really seemed to suggest Bayesian networks. I guess you could allow for a distribution of possible activation functions, but that doesn’t really fit what he said about learning the “exact” nonlinear function for every possible function. That fits more with bayes nets, which use a lookup table for every node.
Your example sounds like a bayesian net. But it doesn’t really fit his description of learning optimal nonlinearities for functions.