It’s true that it’s only possible to find local optima, but that’s true with any algorithm.
Whaaaat? Exhaustive search is an algorithm, it will find you the global optimum anywhere. For many structures of the search space it’s not hard to find the global optimum with appropriate algorithms.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum liklihood approaches bayesian inference.
In that sense stochastic gradient descent will also find the global optimum, since the randomness will eventually push it to every point possible. It will just take the eternity of the universe, but so will exhaustive search.
It’s also trivial to modify any local algorithm to be global, by occasionally moving around randomly. This is also effective in practice, at finding better local optima.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference.
I used Maximum likelihood as an example, which is where you take the most probable hypothesis (parameters in a statistical model.) Instead of weighing many hypotheses the bayesian way. If you have enough data, the most probable hypothesis should converge to the correct one.
There is a view that everything that works must be an approximation of the ideal Bayesian method.
You can reformulate many problems in the Bayesian framework. This does not mean that everything is an approximation of Bayesianism—just like the ability to translate a novel into French does not mean that each novel is an approximation of a French roman.
It’s deeper than that. Bayesian probability theory is a mathematical law. Anything method that works must be computing an approximation of it. Just like Newtonian mechanics is a very close approximation of relativity. But they are not equivalent.
Bayesian probability theory is a mathematical law.
That is not true. The Bayes equation is mathematically correct. A theory is much wider—for example, Bayesians interpret probability as a degree of belief—is that also a mathematical law? You need a prior to start—what does the “mathematical law” say about priors?
Tell me, did Eliezer even address PAC learning in his writing? If not, I would say that he’s being over-confident and ignorant in stating that Bayesian probability is all there is and everything else is a mere approximation.
PAC-learning is definitely something we don’t talk about enough around here, but I don’t see what the conflict is with it being an approximation of Bayesian updating.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
But I’m definitely a n00b at this stuff, so if you have other ideas (and reading recommendations) I’d be happy to hear them.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
The “update” doesn’t use Bayes’s rule; there’s no prior; there’s no concept of belief. Why should we still consider it Bayesian? I mean, if you consider any learning to be an approximation of Bayesian updating, then sure, PAC-learning qualifies. But that begs the question, doesn’t it?
The set of possible Turing Machines is infinite. Whether you consider that to satisfy your personal definition of “seen” or “in reality” isn’t really relevant.
Whaaaat? Exhaustive search is an algorithm, it will find you the global optimum anywhere. For many structures of the search space it’s not hard to find the global optimum with appropriate algorithms.
Huh?
In that sense stochastic gradient descent will also find the global optimum, since the randomness will eventually push it to every point possible. It will just take the eternity of the universe, but so will exhaustive search.
It’s also trivial to modify any local algorithm to be global, by occasionally moving around randomly. This is also effective in practice, at finding better local optima.
There is a view that everything that works must be an approximation of the ideal Bayesian method. This is argued by Yudkowsky in Beautiful Probability and Searching for Bayes-Structure.
I used Maximum likelihood as an example, which is where you take the most probable hypothesis (parameters in a statistical model.) Instead of weighing many hypotheses the bayesian way. If you have enough data, the most probable hypothesis should converge to the correct one.
You can reformulate many problems in the Bayesian framework. This does not mean that everything is an approximation of Bayesianism—just like the ability to translate a novel into French does not mean that each novel is an approximation of a French roman.
It’s deeper than that. Bayesian probability theory is a mathematical law. Anything method that works must be computing an approximation of it. Just like Newtonian mechanics is a very close approximation of relativity. But they are not equivalent.
That is not true. The Bayes equation is mathematically correct. A theory is much wider—for example, Bayesians interpret probability as a degree of belief—is that also a mathematical law? You need a prior to start—what does the “mathematical law” say about priors?
Tell me, did Eliezer even address PAC learning in his writing? If not, I would say that he’s being over-confident and ignorant in stating that Bayesian probability is all there is and everything else is a mere approximation.
PAC-learning is definitely something we don’t talk about enough around here, but I don’t see what the conflict is with it being an approximation of Bayesian updating.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
But I’m definitely a n00b at this stuff, so if you have other ideas (and reading recommendations) I’d be happy to hear them.
The “update” doesn’t use Bayes’s rule; there’s no prior; there’s no concept of belief. Why should we still consider it Bayesian? I mean, if you consider any learning to be an approximation of Bayesian updating, then sure, PAC-learning qualifies. But that begs the question, doesn’t it?
You can’t do an exhaustive search on an infinite set.
I haven’t seen any infinite sets in reality.
The set of possible Turing Machines is infinite. Whether you consider that to satisfy your personal definition of “seen” or “in reality” isn’t really relevant.