People here seem to really like Solomonoff induction, but I don’t think it’s all that relevant to learning in practice due to computational complexity.
Solomonoff induction is not computable. Trying to “approximate” it, by coming up with hypotheses similar to it, is probably also not computable.
If you replace Solomonoff induction with induction over programs that halt quickly or induction over circuits, it becomes computable, but it is still NP-hard. Again, approximating this is probably also NP-hard, depending on your definition of approximation.
Next, if you replace boolean circuits with neural nets, it is still hard to find the best neural net to fit the data. MCMC and gradient descent only find local optima. I mean, the fact that neural nets didn’t give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.
It’s not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don’t do that at all.
Not strictly true. Simply by overparameterizing the ANN, you can emulate parallel gradient descent (global optimization). Stochastic techniques—such as dropout—and tricks such as momentum can also help tunnel through local optima. In addition, the typical optimization hurdles tend to be saddlepoints rather than local optima.
Modern SGD mechanisms are powerful global optimizers.
I mean, the fact that neural nets didn’t give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.
It demonstrates nothing of the sort, but such a demonstration would be pointless anyway. Solomonoff induction is completely worthless—intractable—so you absolutely don’t want to do that anyway.
They [modern SGD systems] are heuristic optimizers that have no guarantees of finding a global optimum. It’s strange to call them “powerful global optimizers”.
Not at all strange. As I pointed out in the post above—modern SGD functions as a global optimizer due to clever tricks such as over-parameterization, dropout, and momentum. These optimizers are also suprisingly powerful as measured in terms of actual solution quality vs computational cost—indeed they dominate most of the landscape. It is completely disingenous to call the dominate solution anything other than powerful.
In practice strong guarantees are not important. What actually matters is finding a solution that is good enough relative to computational costs, where ‘good enough’ is determined according to some more complex business/economic utility metric, where solution quality follows diminishing returns. For sufficiently complex problems, finding the actual global optimum is rarely ideal—it is far too far to the right on the asymptotic diminishing returns curve.
What actually matters is finding a solution that is good enough relative to computational costs
That is true, but calling methods to find such solutions “global optimizers” is still a misnaming. Adding some noise/randomness to your local optimization does not a global optimizer make.
Of course, stochastic local-ish optimizers can be very powerful and very useful. They are just not global.
Disagree, but this may be now on the level of terminology.
By powerful global optimizer, I mean an optimizer which can find global optimums roughly using similar time/space to other techniques, and more importantly can find good enough solutions that are very close to the true global optimum but using much less time/space.
Adding some noise/randomness to your local optimization does not a global optimizer make.
Actually it does. As I pointed out above, adding the right kind of noise/randomness allows SGD to find the global optimum (given enough chances/time). This is just trivially true—if you randomly search long enough and retain the best solution found so far, you are guaranteed to eventually find the optimal solution.
Thinking about it a bit, the “global” vs “local” property is not really a property of an optimizer, as it is a combined property of the optimizer and the search space (aka the error landscape). For an entirely arbitrary search space, only the exhaustive search and equivalents will guarantee you the global optimum. For simple search spaces (e.g. quadratic optimization) easy “local” things like hill-climbing will guarantee you the global optimum. If your solution is bounded, you always know how close to the theoretical best you are and can adjust the search accordingly (e.g. terminate early), etc.
Yes. In some cases simple (convex) search spaces even permit exact non-iterative analytic solutions which are faster than any iterative approach like SGD (although curiously, they are typically not dramatically faster—as they still need to analyze each example once, and the number of full epochs/iterations in SGD type approaches appears to be sublinear. )
The more advanced methods—natural gradient, hessian free, etc—expend extra computation to analyze the local search space and then use this analysis to make an improved nonlocal jump. This always has extra cost though, so these techniques just end up tending to match SGD in terms of net optimization speed.
Looking at it another way, when a researcher uses something like a convex optimization technique to solve a problem more quickly—what actually happened is the combined human+machine system used a global optimizer to solve the problem, with more of the algorithm running in the human’s brain than on the machine. ;)
It’s not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don’t do that at all.
Can you expand on the relationship between the two? As far as I can see, Bayesian learning starts from a prior distribution, and PAC learning (for PAC-learnable concepts) works independently of the distribution. That is, they’re solving average-case vs. worst-case or adversarial problems, which is a subject Eliezer has discussed here. Further discussion, mostly not by Eliezer, here. Those discussions don’t mention PAC though.
PAC-Bayesian hybrids appear to also be a thing, but all I know about that is the Google search I just did.
I suppose that in some sense, the Bayesian-vs-PAC comparison is the learning analogue of the average-case vs. worst-case comparison (where Bayesian=average case, PAC=worst case). It’s not a perfect analogy, but in both cases a big question is what if you don’t know the distribution/prior?
In the learning version, I think the problem is even more severe—I don’t know if an analogue of de-randomization works anymore, and I don’t think the “P=BPP” conjecture has an analogue.
The learning version has some additional complications, in that Bayesian inference is computationally intractable in many cases.
I suppose the Bayesian answer to that is that a probability distribution is a description of one’s knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it’s a weakness of Bayesian reasoning. Not being a statistician, I don’t need to take a view, although I incline against arguments deducing impossibility from ignorance.
The learning version
What are you contrasting with learning? I don’t see a demarcation of principle between learning and any other sort of statistical prediction.
I suppose the Bayesian answer to that is that a probability distribution is a description of one’s knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it’s a weakness of Bayesian reasoning. Not being a statistician, I don’t need to take a view, although I incline against arguments deducing impossibility from ignorance.
I don’t have any strong disagreements there. But consider: if we can learn well even without assuming any distribution or prior, isn’t that worth exploring? The fact that there is an alternative to Bayesianism—one that we can prove works (in some well-defined settings), and isn’t just naive frequentism—is pretty fascinating, isn’t it?
What are you contrasting with learning?
I’m contrasting randomized vs. deterministic algorithms, which Eliezer discussed in your linked article, with Bayesian vs. PAC learning models. The randomized vs. deterministic question shouldn’t really be considered learning, unless you want to call things like primality testing “learning”.
It’s true that it’s only possible to find local optima, but that’s true with any algorithm. Unless P=NP, perfect mathematically optimal AI is probably impossible. But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
As to why we can have decent machine learning and not AGI, I don’t know. Approximating SI isn’t sufficient for one, you need to act on the models you find. But even so, I’m more trying to say it’s is a promising research direction, and better than other stuff. I’m not saying it’s a solved problem. Or that there are no further advances to be made.
It’s not even clear that a learning program must approximate Bayesian inference.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference. And that’s how NNs are typically trained. But it’s nice to have other hypotheses and not just assume that the most likely hypothesis is 100% correct.
But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
As to why we can have decent machine learning and not AGI, I don’t know.
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
Approximating SI isn’t sufficient for one, you need to act on the models you find.
That’s trivial to do. It’s not the problem here.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is.
This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
SGD seems to be sufficient to train neural networks on real world datasets. I would honestly argue that it’s better than whatever algorithms humans use. In that generally it learns much faster and far better on many problems.
That’s not sufficient for general intelligence. But I think the missing piece is not some super powerful optimization algorithm. A better optimizer is not going to change the field much or give us AGI.
And I also don’t think the magic of the human brain is in really good optimization. Most scientists don’t even believe that the brain can do anything as advanced as backpropagation, which requires accurately estimating gradients and propagating them backwards through long time steps. Most biologically plausible algorithms I’ve seen are super restricted or primitive.
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
That’s trivial to do. It’s not the problem here.
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
Even if you create a bunch of really good models, simulating them forward thousands of steps for every possible series of actions is impossible.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
If you can find the optimal NN, that basically lets you solve circuit minimization, an NP-hard task. This will allow you to find the best computationally-tractable hypothesis for any problem, which is similar to Solomonoff induction for practical purposes. It will certainly be a huge improvement over current NN approaches, and it may indeed lead to AGI. Unfortunately, it’s probably impossible.
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
I was only trying to say that if you’re finding the best NN, then simulating them is easy. I agree that this is not the full AIXI. I guess I misunderstood you—I thought you were trying to say that the reason NN doesn’t give us AGI is because they are hard to simulate.
PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
I looked into PAC learning a bit when Scott Aaronson talked about it on his blog, and came to the following conclusion. ‘Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.’
Of course that was 7 years ago and I probably barely scratched the surface of the PAC learning literature even then. Are there any PAC learning results which can’t be reinterpreted this way?
PAC-learning has no concept of prior or even of likelihood, and it allows you to learn regardless. If by “Bayesianism” you mean “learning”, then sure, PAC-learning is a type of Bayesianism. But I don’t see why it’s useful to view it that way (Bayes’s rule is never used, for example).
It’s true that it’s only possible to find local optima, but that’s true with any algorithm.
Whaaaat? Exhaustive search is an algorithm, it will find you the global optimum anywhere. For many structures of the search space it’s not hard to find the global optimum with appropriate algorithms.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum liklihood approaches bayesian inference.
In that sense stochastic gradient descent will also find the global optimum, since the randomness will eventually push it to every point possible. It will just take the eternity of the universe, but so will exhaustive search.
It’s also trivial to modify any local algorithm to be global, by occasionally moving around randomly. This is also effective in practice, at finding better local optima.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference.
I used Maximum likelihood as an example, which is where you take the most probable hypothesis (parameters in a statistical model.) Instead of weighing many hypotheses the bayesian way. If you have enough data, the most probable hypothesis should converge to the correct one.
There is a view that everything that works must be an approximation of the ideal Bayesian method.
You can reformulate many problems in the Bayesian framework. This does not mean that everything is an approximation of Bayesianism—just like the ability to translate a novel into French does not mean that each novel is an approximation of a French roman.
It’s deeper than that. Bayesian probability theory is a mathematical law. Anything method that works must be computing an approximation of it. Just like Newtonian mechanics is a very close approximation of relativity. But they are not equivalent.
Bayesian probability theory is a mathematical law.
That is not true. The Bayes equation is mathematically correct. A theory is much wider—for example, Bayesians interpret probability as a degree of belief—is that also a mathematical law? You need a prior to start—what does the “mathematical law” say about priors?
Tell me, did Eliezer even address PAC learning in his writing? If not, I would say that he’s being over-confident and ignorant in stating that Bayesian probability is all there is and everything else is a mere approximation.
PAC-learning is definitely something we don’t talk about enough around here, but I don’t see what the conflict is with it being an approximation of Bayesian updating.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
But I’m definitely a n00b at this stuff, so if you have other ideas (and reading recommendations) I’d be happy to hear them.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
The “update” doesn’t use Bayes’s rule; there’s no prior; there’s no concept of belief. Why should we still consider it Bayesian? I mean, if you consider any learning to be an approximation of Bayesian updating, then sure, PAC-learning qualifies. But that begs the question, doesn’t it?
The set of possible Turing Machines is infinite. Whether you consider that to satisfy your personal definition of “seen” or “in reality” isn’t really relevant.
People here seem to really like Solomonoff induction, but I don’t think it’s all that relevant to learning in practice due to computational complexity.
Solomonoff induction is not computable. Trying to “approximate” it, by coming up with hypotheses similar to it, is probably also not computable.
If you replace Solomonoff induction with induction over programs that halt quickly or induction over circuits, it becomes computable, but it is still NP-hard. Again, approximating this is probably also NP-hard, depending on your definition of approximation.
Next, if you replace boolean circuits with neural nets, it is still hard to find the best neural net to fit the data. MCMC and gradient descent only find local optima. I mean, the fact that neural nets didn’t give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.
It’s not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don’t do that at all.
Not strictly true. Simply by overparameterizing the ANN, you can emulate parallel gradient descent (global optimization). Stochastic techniques—such as dropout—and tricks such as momentum can also help tunnel through local optima. In addition, the typical optimization hurdles tend to be saddlepoints rather than local optima.
Modern SGD mechanisms are powerful global optimizers.
It demonstrates nothing of the sort, but such a demonstration would be pointless anyway. Solomonoff induction is completely worthless—intractable—so you absolutely don’t want to do that anyway.
They are heuristic optimizers that have no guarantees of finding a global optimum. It’s strange to call them “powerful global optimizers”.
I believe that was my point.
Not at all strange. As I pointed out in the post above—modern SGD functions as a global optimizer due to clever tricks such as over-parameterization, dropout, and momentum. These optimizers are also suprisingly powerful as measured in terms of actual solution quality vs computational cost—indeed they dominate most of the landscape. It is completely disingenous to call the dominate solution anything other than powerful.
In practice strong guarantees are not important. What actually matters is finding a solution that is good enough relative to computational costs, where ‘good enough’ is determined according to some more complex business/economic utility metric, where solution quality follows diminishing returns. For sufficiently complex problems, finding the actual global optimum is rarely ideal—it is far too far to the right on the asymptotic diminishing returns curve.
That is true, but calling methods to find such solutions “global optimizers” is still a misnaming. Adding some noise/randomness to your local optimization does not a global optimizer make.
Of course, stochastic local-ish optimizers can be very powerful and very useful. They are just not global.
Disagree, but this may be now on the level of terminology.
By powerful global optimizer, I mean an optimizer which can find global optimums roughly using similar time/space to other techniques, and more importantly can find good enough solutions that are very close to the true global optimum but using much less time/space.
Actually it does. As I pointed out above, adding the right kind of noise/randomness allows SGD to find the global optimum (given enough chances/time). This is just trivially true—if you randomly search long enough and retain the best solution found so far, you are guaranteed to eventually find the optimal solution.
Thinking about it a bit, the “global” vs “local” property is not really a property of an optimizer, as it is a combined property of the optimizer and the search space (aka the error landscape). For an entirely arbitrary search space, only the exhaustive search and equivalents will guarantee you the global optimum. For simple search spaces (e.g. quadratic optimization) easy “local” things like hill-climbing will guarantee you the global optimum. If your solution is bounded, you always know how close to the theoretical best you are and can adjust the search accordingly (e.g. terminate early), etc.
Yes. In some cases simple (convex) search spaces even permit exact non-iterative analytic solutions which are faster than any iterative approach like SGD (although curiously, they are typically not dramatically faster—as they still need to analyze each example once, and the number of full epochs/iterations in SGD type approaches appears to be sublinear. )
The more advanced methods—natural gradient, hessian free, etc—expend extra computation to analyze the local search space and then use this analysis to make an improved nonlocal jump. This always has extra cost though, so these techniques just end up tending to match SGD in terms of net optimization speed.
Looking at it another way, when a researcher uses something like a convex optimization technique to solve a problem more quickly—what actually happened is the combined human+machine system used a global optimizer to solve the problem, with more of the algorithm running in the human’s brain than on the machine. ;)
Can you expand on the relationship between the two? As far as I can see, Bayesian learning starts from a prior distribution, and PAC learning (for PAC-learnable concepts) works independently of the distribution. That is, they’re solving average-case vs. worst-case or adversarial problems, which is a subject Eliezer has discussed here. Further discussion, mostly not by Eliezer, here. Those discussions don’t mention PAC though.
PAC-Bayesian hybrids appear to also be a thing, but all I know about that is the Google search I just did.
I suppose that in some sense, the Bayesian-vs-PAC comparison is the learning analogue of the average-case vs. worst-case comparison (where Bayesian=average case, PAC=worst case). It’s not a perfect analogy, but in both cases a big question is what if you don’t know the distribution/prior?
In the learning version, I think the problem is even more severe—I don’t know if an analogue of de-randomization works anymore, and I don’t think the “P=BPP” conjecture has an analogue.
The learning version has some additional complications, in that Bayesian inference is computationally intractable in many cases.
I suppose the Bayesian answer to that is that a probability distribution is a description of one’s knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it’s a weakness of Bayesian reasoning. Not being a statistician, I don’t need to take a view, although I incline against arguments deducing impossibility from ignorance.
What are you contrasting with learning? I don’t see a demarcation of principle between learning and any other sort of statistical prediction.
I don’t have any strong disagreements there. But consider: if we can learn well even without assuming any distribution or prior, isn’t that worth exploring? The fact that there is an alternative to Bayesianism—one that we can prove works (in some well-defined settings), and isn’t just naive frequentism—is pretty fascinating, isn’t it?
I’m contrasting randomized vs. deterministic algorithms, which Eliezer discussed in your linked article, with Bayesian vs. PAC learning models. The randomized vs. deterministic question shouldn’t really be considered learning, unless you want to call things like primality testing “learning”.
It’s true that it’s only possible to find local optima, but that’s true with any algorithm. Unless P=NP, perfect mathematically optimal AI is probably impossible. But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
As to why we can have decent machine learning and not AGI, I don’t know. Approximating SI isn’t sufficient for one, you need to act on the models you find. But even so, I’m more trying to say it’s is a promising research direction, and better than other stuff. I’m not saying it’s a solved problem. Or that there are no further advances to be made.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference. And that’s how NNs are typically trained. But it’s nice to have other hypotheses and not just assume that the most likely hypothesis is 100% correct.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
That’s trivial to do. It’s not the problem here.
This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.
SGD seems to be sufficient to train neural networks on real world datasets. I would honestly argue that it’s better than whatever algorithms humans use. In that generally it learns much faster and far better on many problems.
That’s not sufficient for general intelligence. But I think the missing piece is not some super powerful optimization algorithm. A better optimizer is not going to change the field much or give us AGI.
And I also don’t think the magic of the human brain is in really good optimization. Most scientists don’t even believe that the brain can do anything as advanced as backpropagation, which requires accurately estimating gradients and propagating them backwards through long time steps. Most biologically plausible algorithms I’ve seen are super restricted or primitive.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
Even if you create a bunch of really good models, simulating them forward thousands of steps for every possible series of actions is impossible.
If you can find the optimal NN, that basically lets you solve circuit minimization, an NP-hard task. This will allow you to find the best computationally-tractable hypothesis for any problem, which is similar to Solomonoff induction for practical purposes. It will certainly be a huge improvement over current NN approaches, and it may indeed lead to AGI. Unfortunately, it’s probably impossible.
I was only trying to say that if you’re finding the best NN, then simulating them is easy. I agree that this is not the full AIXI. I guess I misunderstood you—I thought you were trying to say that the reason NN doesn’t give us AGI is because they are hard to simulate.
I looked into PAC learning a bit when Scott Aaronson talked about it on his blog, and came to the following conclusion. ‘Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.’
Of course that was 7 years ago and I probably barely scratched the surface of the PAC learning literature even then. Are there any PAC learning results which can’t be reinterpreted this way?
PAC-learning has no concept of prior or even of likelihood, and it allows you to learn regardless. If by “Bayesianism” you mean “learning”, then sure, PAC-learning is a type of Bayesianism. But I don’t see why it’s useful to view it that way (Bayes’s rule is never used, for example).
Whaaaat? Exhaustive search is an algorithm, it will find you the global optimum anywhere. For many structures of the search space it’s not hard to find the global optimum with appropriate algorithms.
Huh?
In that sense stochastic gradient descent will also find the global optimum, since the randomness will eventually push it to every point possible. It will just take the eternity of the universe, but so will exhaustive search.
It’s also trivial to modify any local algorithm to be global, by occasionally moving around randomly. This is also effective in practice, at finding better local optima.
There is a view that everything that works must be an approximation of the ideal Bayesian method. This is argued by Yudkowsky in Beautiful Probability and Searching for Bayes-Structure.
I used Maximum likelihood as an example, which is where you take the most probable hypothesis (parameters in a statistical model.) Instead of weighing many hypotheses the bayesian way. If you have enough data, the most probable hypothesis should converge to the correct one.
You can reformulate many problems in the Bayesian framework. This does not mean that everything is an approximation of Bayesianism—just like the ability to translate a novel into French does not mean that each novel is an approximation of a French roman.
It’s deeper than that. Bayesian probability theory is a mathematical law. Anything method that works must be computing an approximation of it. Just like Newtonian mechanics is a very close approximation of relativity. But they are not equivalent.
That is not true. The Bayes equation is mathematically correct. A theory is much wider—for example, Bayesians interpret probability as a degree of belief—is that also a mathematical law? You need a prior to start—what does the “mathematical law” say about priors?
Tell me, did Eliezer even address PAC learning in his writing? If not, I would say that he’s being over-confident and ignorant in stating that Bayesian probability is all there is and everything else is a mere approximation.
PAC-learning is definitely something we don’t talk about enough around here, but I don’t see what the conflict is with it being an approximation of Bayesian updating.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
But I’m definitely a n00b at this stuff, so if you have other ideas (and reading recommendations) I’d be happy to hear them.
The “update” doesn’t use Bayes’s rule; there’s no prior; there’s no concept of belief. Why should we still consider it Bayesian? I mean, if you consider any learning to be an approximation of Bayesian updating, then sure, PAC-learning qualifies. But that begs the question, doesn’t it?
You can’t do an exhaustive search on an infinite set.
I haven’t seen any infinite sets in reality.
The set of possible Turing Machines is infinite. Whether you consider that to satisfy your personal definition of “seen” or “in reality” isn’t really relevant.