It’s true that it’s only possible to find local optima, but that’s true with any algorithm. Unless P=NP, perfect mathematically optimal AI is probably impossible. But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
As to why we can have decent machine learning and not AGI, I don’t know. Approximating SI isn’t sufficient for one, you need to act on the models you find. But even so, I’m more trying to say it’s is a promising research direction, and better than other stuff. I’m not saying it’s a solved problem. Or that there are no further advances to be made.
It’s not even clear that a learning program must approximate Bayesian inference.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference. And that’s how NNs are typically trained. But it’s nice to have other hypotheses and not just assume that the most likely hypothesis is 100% correct.
But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
As to why we can have decent machine learning and not AGI, I don’t know.
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
Approximating SI isn’t sufficient for one, you need to act on the models you find.
That’s trivial to do. It’s not the problem here.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is.
This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
SGD seems to be sufficient to train neural networks on real world datasets. I would honestly argue that it’s better than whatever algorithms humans use. In that generally it learns much faster and far better on many problems.
That’s not sufficient for general intelligence. But I think the missing piece is not some super powerful optimization algorithm. A better optimizer is not going to change the field much or give us AGI.
And I also don’t think the magic of the human brain is in really good optimization. Most scientists don’t even believe that the brain can do anything as advanced as backpropagation, which requires accurately estimating gradients and propagating them backwards through long time steps. Most biologically plausible algorithms I’ve seen are super restricted or primitive.
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
That’s trivial to do. It’s not the problem here.
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
Even if you create a bunch of really good models, simulating them forward thousands of steps for every possible series of actions is impossible.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
If you can find the optimal NN, that basically lets you solve circuit minimization, an NP-hard task. This will allow you to find the best computationally-tractable hypothesis for any problem, which is similar to Solomonoff induction for practical purposes. It will certainly be a huge improvement over current NN approaches, and it may indeed lead to AGI. Unfortunately, it’s probably impossible.
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
I was only trying to say that if you’re finding the best NN, then simulating them is easy. I agree that this is not the full AIXI. I guess I misunderstood you—I thought you were trying to say that the reason NN doesn’t give us AGI is because they are hard to simulate.
PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
I looked into PAC learning a bit when Scott Aaronson talked about it on his blog, and came to the following conclusion. ‘Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.’
Of course that was 7 years ago and I probably barely scratched the surface of the PAC learning literature even then. Are there any PAC learning results which can’t be reinterpreted this way?
PAC-learning has no concept of prior or even of likelihood, and it allows you to learn regardless. If by “Bayesianism” you mean “learning”, then sure, PAC-learning is a type of Bayesianism. But I don’t see why it’s useful to view it that way (Bayes’s rule is never used, for example).
It’s true that it’s only possible to find local optima, but that’s true with any algorithm.
Whaaaat? Exhaustive search is an algorithm, it will find you the global optimum anywhere. For many structures of the search space it’s not hard to find the global optimum with appropriate algorithms.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum liklihood approaches bayesian inference.
In that sense stochastic gradient descent will also find the global optimum, since the randomness will eventually push it to every point possible. It will just take the eternity of the universe, but so will exhaustive search.
It’s also trivial to modify any local algorithm to be global, by occasionally moving around randomly. This is also effective in practice, at finding better local optima.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference.
I used Maximum likelihood as an example, which is where you take the most probable hypothesis (parameters in a statistical model.) Instead of weighing many hypotheses the bayesian way. If you have enough data, the most probable hypothesis should converge to the correct one.
There is a view that everything that works must be an approximation of the ideal Bayesian method.
You can reformulate many problems in the Bayesian framework. This does not mean that everything is an approximation of Bayesianism—just like the ability to translate a novel into French does not mean that each novel is an approximation of a French roman.
It’s deeper than that. Bayesian probability theory is a mathematical law. Anything method that works must be computing an approximation of it. Just like Newtonian mechanics is a very close approximation of relativity. But they are not equivalent.
Bayesian probability theory is a mathematical law.
That is not true. The Bayes equation is mathematically correct. A theory is much wider—for example, Bayesians interpret probability as a degree of belief—is that also a mathematical law? You need a prior to start—what does the “mathematical law” say about priors?
Tell me, did Eliezer even address PAC learning in his writing? If not, I would say that he’s being over-confident and ignorant in stating that Bayesian probability is all there is and everything else is a mere approximation.
PAC-learning is definitely something we don’t talk about enough around here, but I don’t see what the conflict is with it being an approximation of Bayesian updating.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
But I’m definitely a n00b at this stuff, so if you have other ideas (and reading recommendations) I’d be happy to hear them.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
The “update” doesn’t use Bayes’s rule; there’s no prior; there’s no concept of belief. Why should we still consider it Bayesian? I mean, if you consider any learning to be an approximation of Bayesian updating, then sure, PAC-learning qualifies. But that begs the question, doesn’t it?
The set of possible Turing Machines is infinite. Whether you consider that to satisfy your personal definition of “seen” or “in reality” isn’t really relevant.
It’s true that it’s only possible to find local optima, but that’s true with any algorithm. Unless P=NP, perfect mathematically optimal AI is probably impossible. But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
As to why we can have decent machine learning and not AGI, I don’t know. Approximating SI isn’t sufficient for one, you need to act on the models you find. But even so, I’m more trying to say it’s is a promising research direction, and better than other stuff. I’m not saying it’s a solved problem. Or that there are no further advances to be made.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference. And that’s how NNs are typically trained. But it’s nice to have other hypotheses and not just assume that the most likely hypothesis is 100% correct.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
That’s trivial to do. It’s not the problem here.
This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.
SGD seems to be sufficient to train neural networks on real world datasets. I would honestly argue that it’s better than whatever algorithms humans use. In that generally it learns much faster and far better on many problems.
That’s not sufficient for general intelligence. But I think the missing piece is not some super powerful optimization algorithm. A better optimizer is not going to change the field much or give us AGI.
And I also don’t think the magic of the human brain is in really good optimization. Most scientists don’t even believe that the brain can do anything as advanced as backpropagation, which requires accurately estimating gradients and propagating them backwards through long time steps. Most biologically plausible algorithms I’ve seen are super restricted or primitive.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
Even if you create a bunch of really good models, simulating them forward thousands of steps for every possible series of actions is impossible.
If you can find the optimal NN, that basically lets you solve circuit minimization, an NP-hard task. This will allow you to find the best computationally-tractable hypothesis for any problem, which is similar to Solomonoff induction for practical purposes. It will certainly be a huge improvement over current NN approaches, and it may indeed lead to AGI. Unfortunately, it’s probably impossible.
I was only trying to say that if you’re finding the best NN, then simulating them is easy. I agree that this is not the full AIXI. I guess I misunderstood you—I thought you were trying to say that the reason NN doesn’t give us AGI is because they are hard to simulate.
I looked into PAC learning a bit when Scott Aaronson talked about it on his blog, and came to the following conclusion. ‘Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.’
Of course that was 7 years ago and I probably barely scratched the surface of the PAC learning literature even then. Are there any PAC learning results which can’t be reinterpreted this way?
PAC-learning has no concept of prior or even of likelihood, and it allows you to learn regardless. If by “Bayesianism” you mean “learning”, then sure, PAC-learning is a type of Bayesianism. But I don’t see why it’s useful to view it that way (Bayes’s rule is never used, for example).
Whaaaat? Exhaustive search is an algorithm, it will find you the global optimum anywhere. For many structures of the search space it’s not hard to find the global optimum with appropriate algorithms.
Huh?
In that sense stochastic gradient descent will also find the global optimum, since the randomness will eventually push it to every point possible. It will just take the eternity of the universe, but so will exhaustive search.
It’s also trivial to modify any local algorithm to be global, by occasionally moving around randomly. This is also effective in practice, at finding better local optima.
There is a view that everything that works must be an approximation of the ideal Bayesian method. This is argued by Yudkowsky in Beautiful Probability and Searching for Bayes-Structure.
I used Maximum likelihood as an example, which is where you take the most probable hypothesis (parameters in a statistical model.) Instead of weighing many hypotheses the bayesian way. If you have enough data, the most probable hypothesis should converge to the correct one.
You can reformulate many problems in the Bayesian framework. This does not mean that everything is an approximation of Bayesianism—just like the ability to translate a novel into French does not mean that each novel is an approximation of a French roman.
It’s deeper than that. Bayesian probability theory is a mathematical law. Anything method that works must be computing an approximation of it. Just like Newtonian mechanics is a very close approximation of relativity. But they are not equivalent.
That is not true. The Bayes equation is mathematically correct. A theory is much wider—for example, Bayesians interpret probability as a degree of belief—is that also a mathematical law? You need a prior to start—what does the “mathematical law” say about priors?
Tell me, did Eliezer even address PAC learning in his writing? If not, I would say that he’s being over-confident and ignorant in stating that Bayesian probability is all there is and everything else is a mere approximation.
PAC-learning is definitely something we don’t talk about enough around here, but I don’t see what the conflict is with it being an approximation of Bayesian updating.
Here’s how I see it: You’re updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).
But I’m definitely a n00b at this stuff, so if you have other ideas (and reading recommendations) I’d be happy to hear them.
The “update” doesn’t use Bayes’s rule; there’s no prior; there’s no concept of belief. Why should we still consider it Bayesian? I mean, if you consider any learning to be an approximation of Bayesian updating, then sure, PAC-learning qualifies. But that begs the question, doesn’t it?
You can’t do an exhaustive search on an infinite set.
I haven’t seen any infinite sets in reality.
The set of possible Turing Machines is infinite. Whether you consider that to satisfy your personal definition of “seen” or “in reality” isn’t really relevant.