It’s not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don’t do that at all.
Can you expand on the relationship between the two? As far as I can see, Bayesian learning starts from a prior distribution, and PAC learning (for PAC-learnable concepts) works independently of the distribution. That is, they’re solving average-case vs. worst-case or adversarial problems, which is a subject Eliezer has discussed here. Further discussion, mostly not by Eliezer, here. Those discussions don’t mention PAC though.
PAC-Bayesian hybrids appear to also be a thing, but all I know about that is the Google search I just did.
I suppose that in some sense, the Bayesian-vs-PAC comparison is the learning analogue of the average-case vs. worst-case comparison (where Bayesian=average case, PAC=worst case). It’s not a perfect analogy, but in both cases a big question is what if you don’t know the distribution/prior?
In the learning version, I think the problem is even more severe—I don’t know if an analogue of de-randomization works anymore, and I don’t think the “P=BPP” conjecture has an analogue.
The learning version has some additional complications, in that Bayesian inference is computationally intractable in many cases.
I suppose the Bayesian answer to that is that a probability distribution is a description of one’s knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it’s a weakness of Bayesian reasoning. Not being a statistician, I don’t need to take a view, although I incline against arguments deducing impossibility from ignorance.
The learning version
What are you contrasting with learning? I don’t see a demarcation of principle between learning and any other sort of statistical prediction.
I suppose the Bayesian answer to that is that a probability distribution is a description of one’s knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it’s a weakness of Bayesian reasoning. Not being a statistician, I don’t need to take a view, although I incline against arguments deducing impossibility from ignorance.
I don’t have any strong disagreements there. But consider: if we can learn well even without assuming any distribution or prior, isn’t that worth exploring? The fact that there is an alternative to Bayesianism—one that we can prove works (in some well-defined settings), and isn’t just naive frequentism—is pretty fascinating, isn’t it?
What are you contrasting with learning?
I’m contrasting randomized vs. deterministic algorithms, which Eliezer discussed in your linked article, with Bayesian vs. PAC learning models. The randomized vs. deterministic question shouldn’t really be considered learning, unless you want to call things like primality testing “learning”.
Can you expand on the relationship between the two? As far as I can see, Bayesian learning starts from a prior distribution, and PAC learning (for PAC-learnable concepts) works independently of the distribution. That is, they’re solving average-case vs. worst-case or adversarial problems, which is a subject Eliezer has discussed here. Further discussion, mostly not by Eliezer, here. Those discussions don’t mention PAC though.
PAC-Bayesian hybrids appear to also be a thing, but all I know about that is the Google search I just did.
I suppose that in some sense, the Bayesian-vs-PAC comparison is the learning analogue of the average-case vs. worst-case comparison (where Bayesian=average case, PAC=worst case). It’s not a perfect analogy, but in both cases a big question is what if you don’t know the distribution/prior?
In the learning version, I think the problem is even more severe—I don’t know if an analogue of de-randomization works anymore, and I don’t think the “P=BPP” conjecture has an analogue.
The learning version has some additional complications, in that Bayesian inference is computationally intractable in many cases.
I suppose the Bayesian answer to that is that a probability distribution is a description of one’s knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it’s a weakness of Bayesian reasoning. Not being a statistician, I don’t need to take a view, although I incline against arguments deducing impossibility from ignorance.
What are you contrasting with learning? I don’t see a demarcation of principle between learning and any other sort of statistical prediction.
I don’t have any strong disagreements there. But consider: if we can learn well even without assuming any distribution or prior, isn’t that worth exploring? The fact that there is an alternative to Bayesianism—one that we can prove works (in some well-defined settings), and isn’t just naive frequentism—is pretty fascinating, isn’t it?
I’m contrasting randomized vs. deterministic algorithms, which Eliezer discussed in your linked article, with Bayesian vs. PAC learning models. The randomized vs. deterministic question shouldn’t really be considered learning, unless you want to call things like primality testing “learning”.