I don’t think he suggests bayesian networks (which, to me, mean the causal networks of Pearl et al). Rather, he is literally suggesting trying to learn by Bayesian inference. His comments about nonlinearity I think are just to the effect that one shoudn’t have to introduce nonlinearity with sigmoid activation functions, one should have nonlinearity naturally from Bayesian updates. But yeah, I think it’s quite impractical.
E.g. suppose you wanted to build an email spam filter, and wanted P(spam). A (non-naive) Bayesian approach to this classification problem might involve a prior over some large population of email-generating processes. Every time you get a training email, you update your probability that a generic email comes from a particular process, and what their probability was of producing spam. When run on a test email, the spam filter goes through every single hypothesis, evaluates its probability of producing this email, and then takes a weighted average of the spam probabilities of those hypotheses to get its spam / not spam verdict. This seems like too much work.
I don’t know, that comment really seemed to suggest Bayesian networks. I guess you could allow for a distribution of possible activation functions, but that doesn’t really fit what he said about learning the “exact” nonlinear function for every possible function. That fits more with bayes nets, which use a lookup table for every node.
Your example sounds like a bayesian net. But it doesn’t really fit his description of learning optimal nonlinearities for functions.
I don’t think he suggests bayesian networks (which, to me, mean the causal networks of Pearl et al). Rather, he is literally suggesting trying to learn by Bayesian inference. His comments about nonlinearity I think are just to the effect that one shoudn’t have to introduce nonlinearity with sigmoid activation functions, one should have nonlinearity naturally from Bayesian updates. But yeah, I think it’s quite impractical.
E.g. suppose you wanted to build an email spam filter, and wanted P(spam). A (non-naive) Bayesian approach to this classification problem might involve a prior over some large population of email-generating processes. Every time you get a training email, you update your probability that a generic email comes from a particular process, and what their probability was of producing spam. When run on a test email, the spam filter goes through every single hypothesis, evaluates its probability of producing this email, and then takes a weighted average of the spam probabilities of those hypotheses to get its spam / not spam verdict. This seems like too much work.
I don’t know, that comment really seemed to suggest Bayesian networks. I guess you could allow for a distribution of possible activation functions, but that doesn’t really fit what he said about learning the “exact” nonlinear function for every possible function. That fits more with bayes nets, which use a lookup table for every node.
Your example sounds like a bayesian net. But it doesn’t really fit his description of learning optimal nonlinearities for functions.