But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
As to why we can have decent machine learning and not AGI, I don’t know.
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
Approximating SI isn’t sufficient for one, you need to act on the models you find.
That’s trivial to do. It’s not the problem here.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is.
This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
SGD seems to be sufficient to train neural networks on real world datasets. I would honestly argue that it’s better than whatever algorithms humans use. In that generally it learns much faster and far better on many problems.
That’s not sufficient for general intelligence. But I think the missing piece is not some super powerful optimization algorithm. A better optimizer is not going to change the field much or give us AGI.
And I also don’t think the magic of the human brain is in really good optimization. Most scientists don’t even believe that the brain can do anything as advanced as backpropagation, which requires accurately estimating gradients and propagating them backwards through long time steps. Most biologically plausible algorithms I’ve seen are super restricted or primitive.
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
That’s trivial to do. It’s not the problem here.
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
Even if you create a bunch of really good models, simulating them forward thousands of steps for every possible series of actions is impossible.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
If you can find the optimal NN, that basically lets you solve circuit minimization, an NP-hard task. This will allow you to find the best computationally-tractable hypothesis for any problem, which is similar to Solomonoff induction for practical purposes. It will certainly be a huge improvement over current NN approaches, and it may indeed lead to AGI. Unfortunately, it’s probably impossible.
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
I was only trying to say that if you’re finding the best NN, then simulating them is easy. I agree that this is not the full AIXI. I guess I misunderstood you—I thought you were trying to say that the reason NN doesn’t give us AGI is because they are hard to simulate.
PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
I looked into PAC learning a bit when Scott Aaronson talked about it on his blog, and came to the following conclusion. ‘Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.’
Of course that was 7 years ago and I probably barely scratched the surface of the PAC learning literature even then. Are there any PAC learning results which can’t be reinterpreted this way?
PAC-learning has no concept of prior or even of likelihood, and it allows you to learn regardless. If by “Bayesianism” you mean “learning”, then sure, PAC-learning is a type of Bayesianism. But I don’t see why it’s useful to view it that way (Bayes’s rule is never used, for example).
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
That’s trivial to do. It’s not the problem here.
This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.
SGD seems to be sufficient to train neural networks on real world datasets. I would honestly argue that it’s better than whatever algorithms humans use. In that generally it learns much faster and far better on many problems.
That’s not sufficient for general intelligence. But I think the missing piece is not some super powerful optimization algorithm. A better optimizer is not going to change the field much or give us AGI.
And I also don’t think the magic of the human brain is in really good optimization. Most scientists don’t even believe that the brain can do anything as advanced as backpropagation, which requires accurately estimating gradients and propagating them backwards through long time steps. Most biologically plausible algorithms I’ve seen are super restricted or primitive.
I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)
It’s really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.
Even if you create a bunch of really good models, simulating them forward thousands of steps for every possible series of actions is impossible.
If you can find the optimal NN, that basically lets you solve circuit minimization, an NP-hard task. This will allow you to find the best computationally-tractable hypothesis for any problem, which is similar to Solomonoff induction for practical purposes. It will certainly be a huge improvement over current NN approaches, and it may indeed lead to AGI. Unfortunately, it’s probably impossible.
I was only trying to say that if you’re finding the best NN, then simulating them is easy. I agree that this is not the full AIXI. I guess I misunderstood you—I thought you were trying to say that the reason NN doesn’t give us AGI is because they are hard to simulate.
I looked into PAC learning a bit when Scott Aaronson talked about it on his blog, and came to the following conclusion. ‘Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.’
Of course that was 7 years ago and I probably barely scratched the surface of the PAC learning literature even then. Are there any PAC learning results which can’t be reinterpreted this way?
PAC-learning has no concept of prior or even of likelihood, and it allows you to learn regardless. If by “Bayesianism” you mean “learning”, then sure, PAC-learning is a type of Bayesianism. But I don’t see why it’s useful to view it that way (Bayes’s rule is never used, for example).