But there is no way to solve that problem exactly without doing a whole lot of work.[3] For a couple hundred cities, we’re talking about more work than you could fit into the lifespan of the universe with computers millions of times stronger than the best supercomputers in existence.
It’s true that an exact solution might be intractable, but approximate solutions are often good enough. According to wikipedia:
Modern methods can find solutions for extremely large problems (millions of cities) within a reasonable time which are with a high probability just 2–3% away from the optimal solution.
Perhaps there is a yet-undiscovered heuristic algorithm for approximating Solomonoff induction relatively efficiently.
Yes, I wasn’t sure if it was wise to use TSP as an example for that reason. Originally I wrote it using the Hamiltonian Path problem, but thought a non-technical reader would be more able to quickly understand TSP. Maybe that was a mistake. It also seems I may have underestimated how technical my audience would be.
But your point about heuristics is right. That’s basically what I think an AGI based on LLMs would do to figure out the world. However, I doubt there would be one heuristic which could do Solomonoff induction in all scenarios, or even most. Which means you’d have to select the right one, which means you’d need a selection criteria, which takes us back to my original points.
Perhaps there is a yet-undiscovered heuristic algorithm for approximating Solomonoff induction relatively efficiently.
There are—approximate inference on neural networks, such as variants of SGD. Neural networks are a natural universal circuit language, so you can’t get any more than a constant improvement by moving to another universal representation. And in the class of all learning algorithms which approximately converge to full bayesian inference (ie solomonoff induction), SGD style Langevin Dynamics are also unique and difficult to beat in practice—the differences between that and full bayesian inference reduce to higher order corrections which rapidly fall off in utility/op.
I mean it’s like 4 or 5 claims? So not sure which ones you want more in-depth on, but
Neural networks are universal is obvious, as arithmetic/analog circuits they fully generalize (reduce to) binary circuits, which are circuit complete.
A Full Bayesian Inference and Solomonoff Induction are equivalent—fairly obvious
B Approximately converge is near guaranteed if the model is sufficiently overcomplete and trained long enough with correct techniques (normalization, regularization, etc) - as in the worst case you can recover exhaustive exploration ala solomonoff. But SGD on NN is somewhat exponentially faster that exhaustive program search, as it can explore not a single solution at a time, but a number of solutions (sparse sub circuits embedded in the overcomplete model) that is exponential with NN depth (see lottery tickets, dropout, and sum product networks).
C ” differences between that and full bayesian inference reduce to higher order corrections which rapidly fall off in utility/op”. This is known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD, and come up with almost nothing much better than stupidly simple inaccurate but low cost 2nd order correction approximations like Adam. (The research community has come up with an endless stream of higher order optimizers that improve theoretical convergence rate, and near zero that improve wall time convergence speed. ) I do think there is still some room for improvement here, but not anything remotely like “a new category of algorithm”.
But part of my claim simply is that modern DL techniques encompasses nearly all of optimization that is relevant, it simply ate everything, such that the possibility of some new research track not already considered is would be just nomenclature distinction at this point.
Neural networks being universal approximators doesn’t mean they do as well at distributing uncertainty as Solomonoff, right (I’m not entirely sure about this)? Also, are practical neural nets actually close to being universal?
in the worst case you can recover exhaustive exploration ala solomonoff
Do you mean that this is possible in principle, or that this is a limit of SGD training?
known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD
I read your original claim as “SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small”. Experimental evidence that SGD performs competitively does not substantiate that claim, in my view.
Also, are practical neural nets actually close to being universal?
Trivially so—as in they can obviously encode a binary circuit equivalent to a CPU, and also in practice in the sense that transformers descend from related research (neural turing machines, memory networks, etc) and are universal.
Do you mean that this is possible in principle, or that this is a limit of SGD training?
I mean in the worst case where you have some function that is actually really hard to learn—as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search .
“SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small”
Depends on what you mean by “gap”—whether you are measuring inference per unit data or inference per unit compute.
There are clearly scenarios where you can get faster convergence via better using/approximating the higher order terms, but that obviously is not remotely sufficient to beat SGD—as any such extra complexity must also pay for itself against cost of compute.
Of course if you are data starved, then that obviously changes things.
they can obviously encode a binary circuit equivalent to a CPU
A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?
as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search
Sorry, I’m being slow here:
Solomonoff does exhaustive search for any amount of data; is part of your claim that as data → infinity, NN + SGD → Solomonoff?
How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?
Any useful CPU is by my definition—turing universal.
Solomonoff does exhaustive search for any amount of data; is part of your claim that as data-> infinity, NN + SGD → Solomonoff?
You can think of solomonoff as iterating over all programs/circuits by size, evaluating each on all the data, etc.
A sufficiently wide NN + SGD can search the full circuit space up to a depth D across the data set in an efficient way (reusing all subcomputations across sparse subcircuit solutions (lottery tickets)).
Thanks for explaining the way to do exhaustive search—a big network can exhaustively search smaller network configurations. I believe that.
However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.
Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.
It’s true that an exact solution might be intractable, but approximate solutions are often good enough. According to wikipedia:
Perhaps there is a yet-undiscovered heuristic algorithm for approximating Solomonoff induction relatively efficiently.
Yes, I wasn’t sure if it was wise to use TSP as an example for that reason. Originally I wrote it using the Hamiltonian Path problem, but thought a non-technical reader would be more able to quickly understand TSP. Maybe that was a mistake. It also seems I may have underestimated how technical my audience would be.
But your point about heuristics is right. That’s basically what I think an AGI based on LLMs would do to figure out the world. However, I doubt there would be one heuristic which could do Solomonoff induction in all scenarios, or even most. Which means you’d have to select the right one, which means you’d need a selection criteria, which takes us back to my original points.
There are—approximate inference on neural networks, such as variants of SGD. Neural networks are a natural universal circuit language, so you can’t get any more than a constant improvement by moving to another universal representation. And in the class of all learning algorithms which approximately converge to full bayesian inference (ie solomonoff induction), SGD style Langevin Dynamics are also unique and difficult to beat in practice—the differences between that and full bayesian inference reduce to higher order corrections which rapidly fall off in utility/op.
Do you have a link to a more in-depth defense of this claim?
I mean it’s like 4 or 5 claims? So not sure which ones you want more in-depth on, but
Neural networks are universal is obvious, as arithmetic/analog circuits they fully generalize (reduce to) binary circuits, which are circuit complete.
A Full Bayesian Inference and Solomonoff Induction are equivalent—fairly obvious
B Approximately converge is near guaranteed if the model is sufficiently overcomplete and trained long enough with correct techniques (normalization, regularization, etc) - as in the worst case you can recover exhaustive exploration ala solomonoff. But SGD on NN is somewhat exponentially faster that exhaustive program search, as it can explore not a single solution at a time, but a number of solutions (sparse sub circuits embedded in the overcomplete model) that is exponential with NN depth (see lottery tickets, dropout, and sum product networks).
C ” differences between that and full bayesian inference reduce to higher order corrections which rapidly fall off in utility/op”. This is known perhaps experimentally in the sense that the research community has now conducted large-scale extensive (and even often automated) exploration of much of the entire space of higher order corrections to SGD, and come up with almost nothing much better than stupidly simple inaccurate but low cost 2nd order correction approximations like Adam. (The research community has come up with an endless stream of higher order optimizers that improve theoretical convergence rate, and near zero that improve wall time convergence speed. ) I do think there is still some room for improvement here, but not anything remotely like “a new category of algorithm”.
But part of my claim simply is that modern DL techniques encompasses nearly all of optimization that is relevant, it simply ate everything, such that the possibility of some new research track not already considered is would be just nomenclature distinction at this point.
Neural networks being universal approximators doesn’t mean they do as well at distributing uncertainty as Solomonoff, right (I’m not entirely sure about this)? Also, are practical neural nets actually close to being universal?
Do you mean that this is possible in principle, or that this is a limit of SGD training?
I read your original claim as “SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small”. Experimental evidence that SGD performs competitively does not substantiate that claim, in my view.
Trivially so—as in they can obviously encode a binary circuit equivalent to a CPU, and also in practice in the sense that transformers descend from related research (neural turing machines, memory networks, etc) and are universal.
I mean in the worst case where you have some function that is actually really hard to learn—as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search .
Depends on what you mean by “gap”—whether you are measuring inference per unit data or inference per unit compute.
There are clearly scenarios where you can get faster convergence via better using/approximating the higher order terms, but that obviously is not remotely sufficient to beat SGD—as any such extra complexity must also pay for itself against cost of compute.
Of course if you are data starved, then that obviously changes things.
A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?
Sorry, I’m being slow here:
Solomonoff does exhaustive search for any amount of data; is part of your claim that as data → infinity, NN + SGD → Solomonoff?
How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?
Any useful CPU is by my definition—turing universal.
You can think of solomonoff as iterating over all programs/circuits by size, evaluating each on all the data, etc.
A sufficiently wide NN + SGD can search the full circuit space up to a depth D across the data set in an efficient way (reusing all subcomputations across sparse subcircuit solutions (lottery tickets)).
Thanks for explaining the way to do exhaustive search—a big network can exhaustively search smaller network configurations. I believe that.
However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.
Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.