Also, are practical neural nets actually close to being universal?
Trivially so—as in they can obviously encode a binary circuit equivalent to a CPU, and also in practice in the sense that transformers descend from related research (neural turing machines, memory networks, etc) and are universal.
Do you mean that this is possible in principle, or that this is a limit of SGD training?
I mean in the worst case where you have some function that is actually really hard to learn—as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search .
“SGD is known to approximate full Bayesian inference, and the gap between SGD and full inference is known to be small”
Depends on what you mean by “gap”—whether you are measuring inference per unit data or inference per unit compute.
There are clearly scenarios where you can get faster convergence via better using/approximating the higher order terms, but that obviously is not remotely sufficient to beat SGD—as any such extra complexity must also pay for itself against cost of compute.
Of course if you are data starved, then that obviously changes things.
they can obviously encode a binary circuit equivalent to a CPU
A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?
as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search
Sorry, I’m being slow here:
Solomonoff does exhaustive search for any amount of data; is part of your claim that as data → infinity, NN + SGD → Solomonoff?
How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?
Any useful CPU is by my definition—turing universal.
Solomonoff does exhaustive search for any amount of data; is part of your claim that as data-> infinity, NN + SGD → Solomonoff?
You can think of solomonoff as iterating over all programs/circuits by size, evaluating each on all the data, etc.
A sufficiently wide NN + SGD can search the full circuit space up to a depth D across the data set in an efficient way (reusing all subcomputations across sparse subcircuit solutions (lottery tickets)).
Thanks for explaining the way to do exhaustive search—a big network can exhaustively search smaller network configurations. I believe that.
However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.
Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.
Trivially so—as in they can obviously encode a binary circuit equivalent to a CPU, and also in practice in the sense that transformers descend from related research (neural turing machines, memory networks, etc) and are universal.
I mean in the worst case where you have some function that is actually really hard to learn—as long as you have enough data (or can generate it ) - big overcomplete NNs with SGD can obviously perform a strict improvement over exhaustive search .
Depends on what you mean by “gap”—whether you are measuring inference per unit data or inference per unit compute.
There are clearly scenarios where you can get faster convergence via better using/approximating the higher order terms, but that obviously is not remotely sufficient to beat SGD—as any such extra complexity must also pay for itself against cost of compute.
Of course if you are data starved, then that obviously changes things.
A CPU by itself is not universal. Are you saying memory augmented neural networks are practically close to universality?
Sorry, I’m being slow here:
Solomonoff does exhaustive search for any amount of data; is part of your claim that as data → infinity, NN + SGD → Solomonoff?
How do we actually do this improved exhaustive search? Do we know that SGD gets us to a global minimum in the end?
Any useful CPU is by my definition—turing universal.
You can think of solomonoff as iterating over all programs/circuits by size, evaluating each on all the data, etc.
A sufficiently wide NN + SGD can search the full circuit space up to a depth D across the data set in an efficient way (reusing all subcomputations across sparse subcircuit solutions (lottery tickets)).
Thanks for explaining the way to do exhaustive search—a big network can exhaustively search smaller network configurations. I believe that.
However, a CPU is not Turing complete (what is Turing universal?) - a CPU with an infinite read/write tape is Turing complete. This matters, because Solomonoff induction is a mixture of Turing machines. There are simple functions transformers can’t learn, such as “print the binary representation of the input + 1”; they run out of room. Solomonoff induction is not limited in this way.
Practical transformers are also usually (always?) used with exchangeable sequences, while Solomonoff inductors operate on general sequences. I can imagine ways around this (use a RNN and many epochs with a single sequence) so maybe not a fundamental limit, but still a big difference between neural nets in practice and Solomonoff inductors.