Thanks for sharing this, I believe this kind of work is very promising. Especially, it pushes us to be concrete.
I think some of the characteristics you note are necessary side effects of fast parallel computation. Take this canonical constant-depth adder from pg 6 of “Introduction to Circuit Complexity”:
Imagine the same circuit embedded in a neural network. Individual neurons are “merely” simple Boolean operations on the input—no algorithm in sight! More complicated problems often have even messier circuits, with redundant computation or false branches that must be suppressed.
Or imagine a typical program for checking legal moves, some nested for loops. If you actually compile and run such a program you will see the processor perform a long series of local checks. Without the source code any “algorithmness” that remains is in the order those checks are performed in. Now imagine porting that same program to a parallel architecture, like a transformer or the brain. With the checks occurring in parallel there’s no obvious systematicity and the whole thing could be mistaken for a mass of unrelated local computations, the algorithm totally obscured. Even more so if the medium is stochastic.
Actually this relates to a problem in circuit complexity—the notion of algorithm itself becomes more subtle. Since circuits are of fixed finite size, they can only accept one size of input. So, in order to solve a task at any size (as Turing machines do) a different circuit is required for each n. But then what’s to keep each circuit in the family from being wildly different? How could such an infinitely varied family represent a crisp, succinct algorithm? Besides, this arbitrariness lends a sort of unchecked power, for instance the ability to solve the halting problem! (e.g. by hard coding every answer for each n). To tame this, the notion of “uniform complexity” is introduced. A series of circuits is uniform iff the circuits themselves can be generated by a Turing machine. So an “algorithm” is a finite description of a fast parallel circuit for every size.
This seems like it may apply directly for the case of OthelloGPT. If the list of “heuristics” permit a short generating description, they may indeed represent a crisp algorithm—only implemented in parallel. It would be v. interesting if gradient descent preferred such solutions since the generating rule is only ever implicit. If gradient descent had such a preference it could substitute directly for a more traditional notion of algorithm in many arguments (e.g. about generalization/reasoning/world modeling). Really, any idea of which sort of circuits gradient descent prefers in this case could help illuminate how “deep learning works” at all.
Thanks for sharing this, I believe this kind of work is very promising. Especially, it pushes us to be concrete.
I think some of the characteristics you note are necessary side effects of fast parallel computation. Take this canonical constant-depth adder from pg 6 of “Introduction to Circuit Complexity”:
Imagine the same circuit embedded in a neural network. Individual neurons are “merely” simple Boolean operations on the input—no algorithm in sight! More complicated problems often have even messier circuits, with redundant computation or false branches that must be suppressed.
Or imagine a typical program for checking legal moves, some nested for loops. If you actually compile and run such a program you will see the processor perform a long series of local checks. Without the source code any “algorithmness” that remains is in the order those checks are performed in. Now imagine porting that same program to a parallel architecture, like a transformer or the brain. With the checks occurring in parallel there’s no obvious systematicity and the whole thing could be mistaken for a mass of unrelated local computations, the algorithm totally obscured. Even more so if the medium is stochastic.
Actually this relates to a problem in circuit complexity—the notion of algorithm itself becomes more subtle. Since circuits are of fixed finite size, they can only accept one size of input. So, in order to solve a task at any size (as Turing machines do) a different circuit is required for each n. But then what’s to keep each circuit in the family from being wildly different? How could such an infinitely varied family represent a crisp, succinct algorithm? Besides, this arbitrariness lends a sort of unchecked power, for instance the ability to solve the halting problem! (e.g. by hard coding every answer for each n). To tame this, the notion of “uniform complexity” is introduced. A series of circuits is uniform iff the circuits themselves can be generated by a Turing machine. So an “algorithm” is a finite description of a fast parallel circuit for every size.
This seems like it may apply directly for the case of OthelloGPT. If the list of “heuristics” permit a short generating description, they may indeed represent a crisp algorithm—only implemented in parallel. It would be v. interesting if gradient descent preferred such solutions since the generating rule is only ever implicit. If gradient descent had such a preference it could substitute directly for a more traditional notion of algorithm in many arguments (e.g. about generalization/reasoning/world modeling). Really, any idea of which sort of circuits gradient descent prefers in this case could help illuminate how “deep learning works” at all.