Thanks for sharing this, I believe this kind of work is very promising. Especially, it pushes us to be concrete.
I think some of the characteristics you note are necessary side effects of fast parallel computation. Take this canonical constant-depth adder from pg 6 of “Introduction to Circuit Complexity”:
Imagine the same circuit embedded in a neural network. Individual neurons are “merely” simple Boolean operations on the input—no algorithm in sight! More complicated problems often have even messier circuits, with redundant computation or false branches that must be suppressed.
Or imagine a typical program for checking legal moves, some nested for loops. If you actually compile and run such a program you will see the processor perform a long series of local checks. Without the source code any “algorithmness” that remains is in the order those checks are performed in. Now imagine porting that same program to a parallel architecture, like a transformer or the brain. With the checks occurring in parallel there’s no obvious systematicity and the whole thing could be mistaken for a mass of unrelated local computations, the algorithm totally obscured. Even more so if the medium is stochastic.
Actually this relates to a problem in circuit complexity—the notion of algorithm itself becomes more subtle. Since circuits are of fixed finite size, they can only accept one size of input. So, in order to solve a task at any size (as Turing machines do) a different circuit is required for each n. But then what’s to keep each circuit in the family from being wildly different? How could such an infinitely varied family represent a crisp, succinct algorithm? Besides, this arbitrariness lends a sort of unchecked power, for instance the ability to solve the halting problem! (e.g. by hard coding every answer for each n). To tame this, the notion of “uniform complexity” is introduced. A series of circuits is uniform iff the circuits themselves can be generated by a Turing machine. So an “algorithm” is a finite description of a fast parallel circuit for every size.
This seems like it may apply directly for the case of OthelloGPT. If the list of “heuristics” permit a short generating description, they may indeed represent a crisp algorithm—only implemented in parallel. It would be v. interesting if gradient descent preferred such solutions since the generating rule is only ever implicit. If gradient descent had such a preference it could substitute directly for a more traditional notion of algorithm in many arguments (e.g. about generalization/reasoning/world modeling). Really, any idea of which sort of circuits gradient descent prefers in this case could help illuminate how “deep learning works” at all.
I think the best argument for a “fastest-er” timeline would be that several of your bottlenecks end up heavily substituting against each other or some common factor. A researcher in NLP in 2015 might reasonably have guessed it’d take decades to reach the level of ChatGPT—after all, it would require breakthroughs in parsing, entailment, word sense, semantics, world knowledge… In reality these capabilities were all blessings of scale.
o1 may or may not be the central breakthrough in this scenario, but I can paint a world where it is, and that world is very fast indeed. RL, like next word prediction, is “mechanism agnostic”—it induces whatever capabilities are instrumental to maximizing reward. Already, “back-tracking” or “error-correcting” behavior has emerged, previously cited by LeCun and others as a fundamental block. In this world RL applied to “Chains of Action” strongly induces agentic structures like adaptability, tool use, note caching, goal-directedness, and coherence. Gradient descent successfully routes around any weaknesses in LLMs (as I suspect it does for weaknesses in our LLM training pipelines today). By the time we reach the same engineering effort in post-training that we’ve dedicated to pretraining, it’s AGI. Well, given the scale of OpenAI’s investment, we should be able to rule on this scenario pretty quickly.
I’d say I’m skeptical of the specifics (RL hasn’t demonstrated this kind of rich success in the past) but more uncertain about the outline (How well could e.g. fast adaptation trade off against multimodality, gullibility, coherence?)