I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).
Like, I’ve read and re-read the first few sections a number of times, and I still can’t come up with a mental model of HXU’s structure that fits all of the described facts. By “HXU’s structure” I mean things like:
The researcher is running an “evolutionary search in auto-ML” method. How many nested layers of inner/outer loop does this method (explicitly) contain?
Where in the nested structure are (1) the evolutionary search, and (2) the thing that outputs “binary blobs”?
Are the “binary blobs” being run like Meta RNNs, ie they run sequentially in multiple environments?
I assume the answer is yes, because this would explain what it is that (in the 1 Day section) remembers a “history of observation of lots of random environments & datasets.”
What is the type signature of the thing-that-outputs-binary-blobs? What is its input? A task, a task mixture, something else?
Much of the story (eg the “history of observations” passage) makes it sound like we’re watching a single Meta-RNN-ish thing whose trajectories span multiple environment/tasks.
If this Meta-RNN-ish thing is “a blob,” what role is left for the thing-that-outputs-blobs?
That is: in that case, the thing-that-outputs-blobs just looks like fn()→blob. It’s simply a constant, we can eliminate it from the description, and we’re really just doing optimization over blobs. Presumably that’s not the case, so what is going on here?
What is it that’s made of “GPU primitives”?
If the blobs (bytecode?) are being viewed as raw binary sequences and we’re flipping their bits, that’s a lower level than GPU primitives.
If instead the thing-that-outputs-blobs is made of GPU primitives which something else is optimizing over, what is that “something else”?
Is the outermost training loop (the explicitly implemented one) using evolutionary search, or (explicit) gradient descent?
If gradient descent: then what part of the system is using evolutionary search?
If evolutionary search (ES): then how does the outermost loop have a critical batch size? Is the idea that ES exhibits a trend like eqn. 2.11 in the OA paper, w/r/t population size or something, even though it’s not estimating noisy gradients? Is this true? (It could be true, and doesn’t matter for the story . . . but since it doesn’t matter for the story, I don’t know why we’d bothering to assume it)
Also, if evolutionary search (ES): how is this an extrapolation of 2022 ML trends? Current ML is all about finding ways to make things differentiable, and then do GD, which Works™. (And which can be targeted specially by hardware development. And which is assumed by all the ML scaling laws. Etc.) Why are people in 20XX using the “stupidest” optimization process out there, instead?
In all of this, which parts are “doing work” to motivate events in the story?
Is there anything in “1 Day” onward that wouldn’t happen in a mere ginormous GPT / MuZero / whatever, but instead requires this exotic hybrid method?
(If the answer is “yes,” then that sounds like an interesting implicit claim about what currently popular methods can’t do...)
Since I can’t answer these questions in a way that makes sense, I also don’t know how to read the various lines that describe “HXU” doing something, or attribute mental states to “HXU.”
For instance, the thing in “1 Day” that has a world model—is this a single rollout of the Meta-RNN-ish thing, which developed its world model as it chewed its way along a task sequence? In which case, the world model(s) are being continually discarded (!) at the end of every such rollout and then built anew from scratch in the next one? Are we doing the search problem of finding-a-world-model inside of a second search problem?
Where the outer search is (maybe?) happening through ES, which is stupid and needs gajillions of inner rollouts to get anywhere, even on trivial problems?
If the smart-thing-that-copies-itself called “HXU” is a single such rollout, and the 20XX computers can afford gajillions of such rollouts, then what are the slightly less meta 20XX models like, and why haven’t they already eaten the world?
(Less important, but still jumped out at me: in “1 Day,” why is HXU doing “grokking” [i.e. overfitting before the phase transition], as opposed to some other kind of discontinuous capability gain that doesn’t involve overfitting? Like, sure, I suppose it could be grokking here, but this is another one of those paper references that doesn’t seem to be “doing work” to motivate story events.)
I dunno, maybe I’m reading the whole thing more closely or literally than it’s intended? But I imagine you intend the ML references to be taken somewhat more “closely” than the namedrops in your average SF novel, given the prefatory material:
grounded in contemporary ML scaling, self-supervised learning, reinforcement learning, and meta-learning research literature
And I’m not alleging that it is “just namedropping like your average SF novel.” I’m taking the references seriously. But, when I try to view the references as load-bearing pieces in a structure, I can’t make out what that structure is supposed to be.
Relatedly, the story does the gish-gallop thing where many of the links do not actually support the claim they are called on to support. For example, in “learning implicit tree search à la MuZero”, the link to MuZero does not support the claim that MuZero learns implicit tree search. (Originally the link directed to the MuZero paper, which definitely does not do implicit tree search, since it has explicit tree search hard-coded in; now the link goes to gwern’s page on MuZero, which a collection of many papers and it is unclear which one is about learning to do implicit tree search. Note that as far as I know, every Go program that can beat humans has tree search explicitly built in, so implicit tree search is not really a thing.)
The training routine of MuZero (and AlphaZero etc) uses explicit tree search as a source of better policies than the one the model currently spits out, and the model is adapted to output these better policies.
The model is trying to predict the output of the explicit tree search. There’s room to argue over whether or not it “learns implicit tree search” (ie learns to actually “run a search” internally in some sense), but certainly the possibility is not precluded by the presence of the explicit search; the only reason the explicit search is there at all is to give the model a signal about what it should aspire to do without explicit search.
It’s also true that, when the trained models are run in practice, they are usually run with explicit search on top, and this improves their performance. This does not mean they haven’t learned implicit search—only that a single forward pass of the model cannot do as well as a search guided by many forward passes of the same model, which is not a surprising outcome for any model (even models which do some kind of search inside each forward pass).
You’re at most making the claim that MuZero attempts to learn tree search. Does the MuZero paper provide any evidence that MuZero in fact does implicit tree search? I think not, which means it’s still misleading to link to that paper while claiming it shows neural nets can learn implicit tree search (I don’t particularly doubt the can learn it a bit, but I do contest the implication that MuZero does so to any substantial degree or that a non-negligible part of its strength comes from learning implicit tree search).
Edit: I should clarify what would change my mind here. If someone could show that MuZero (or any scaled-up variant of it) can beat humans at Go with the neural-net model alone (without the explicit tree search on top), I would change my mind. To my knowledge, no paper is currently claiming this, but let me know if I am wrong. Since my understanding is that the neural nets alone cannot beat humans, my interpretation is that the neural net part is providing something like roughly human-level “intuition” about what the right move should be, but without any actual search, so humans can still outperform this intuition machine by doing explicit search; but once you add on the tree search, the machines crush humans due to their speed.
I found this story tough to follow on a technical level, despite being familiar with most of the ideas it cites (and having read many of the papers before).
Like, I’ve read and re-read the first few sections a number of times, and I still can’t come up with a mental model of HXU’s structure that fits all of the described facts. By “HXU’s structure” I mean things like:
The researcher is running an “evolutionary search in auto-ML” method. How many nested layers of inner/outer loop does this method (explicitly) contain?
Where in the nested structure are (1) the evolutionary search, and (2) the thing that outputs “binary blobs”?
Are the “binary blobs” being run like Meta RNNs, ie they run sequentially in multiple environments?
I assume the answer is yes, because this would explain what it is that (in the 1 Day section) remembers a “history of observation of lots of random environments & datasets.”
What is the type signature of the thing-that-outputs-binary-blobs? What is its input? A task, a task mixture, something else?
Much of the story (eg the “history of observations” passage) makes it sound like we’re watching a single Meta-RNN-ish thing whose trajectories span multiple environment/tasks.
If this Meta-RNN-ish thing is “a blob,” what role is left for the thing-that-outputs-blobs?
That is: in that case, the thing-that-outputs-blobs just looks like fn()→blob. It’s simply a constant, we can eliminate it from the description, and we’re really just doing optimization over blobs. Presumably that’s not the case, so what is going on here?
What is it that’s made of “GPU primitives”?
If the blobs (bytecode?) are being viewed as raw binary sequences and we’re flipping their bits, that’s a lower level than GPU primitives.
If instead the thing-that-outputs-blobs is made of GPU primitives which something else is optimizing over, what is that “something else”?
Is the outermost training loop (the explicitly implemented one) using evolutionary search, or (explicit) gradient descent?
If gradient descent: then what part of the system is using evolutionary search?
If evolutionary search (ES): then how does the outermost loop have a critical batch size? Is the idea that ES exhibits a trend like eqn. 2.11 in the OA paper, w/r/t population size or something, even though it’s not estimating noisy gradients? Is this true? (It could be true, and doesn’t matter for the story . . . but since it doesn’t matter for the story, I don’t know why we’d bothering to assume it)
Also, if evolutionary search (ES): how is this an extrapolation of 2022 ML trends? Current ML is all about finding ways to make things differentiable, and then do GD, which Works™. (And which can be targeted specially by hardware development. And which is assumed by all the ML scaling laws. Etc.) Why are people in 20XX using the “stupidest” optimization process out there, instead?
In all of this, which parts are “doing work” to motivate events in the story?
Is there anything in “1 Day” onward that wouldn’t happen in a mere ginormous GPT / MuZero / whatever, but instead requires this exotic hybrid method?
(If the answer is “yes,” then that sounds like an interesting implicit claim about what currently popular methods can’t do...)
Since I can’t answer these questions in a way that makes sense, I also don’t know how to read the various lines that describe “HXU” doing something, or attribute mental states to “HXU.”
For instance, the thing in “1 Day” that has a world model—is this a single rollout of the Meta-RNN-ish thing, which developed its world model as it chewed its way along a task sequence? In which case, the world model(s) are being continually discarded (!) at the end of every such rollout and then built anew from scratch in the next one? Are we doing the search problem of finding-a-world-model inside of a second search problem?
Where the outer search is (maybe?) happening through ES, which is stupid and needs gajillions of inner rollouts to get anywhere, even on trivial problems?
If the smart-thing-that-copies-itself called “HXU” is a single such rollout, and the 20XX computers can afford gajillions of such rollouts, then what are the slightly less meta 20XX models like, and why haven’t they already eaten the world?
(Less important, but still jumped out at me: in “1 Day,” why is HXU doing “grokking” [i.e. overfitting before the phase transition], as opposed to some other kind of discontinuous capability gain that doesn’t involve overfitting? Like, sure, I suppose it could be grokking here, but this is another one of those paper references that doesn’t seem to be “doing work” to motivate story events.)
I dunno, maybe I’m reading the whole thing more closely or literally than it’s intended? But I imagine you intend the ML references to be taken somewhat more “closely” than the namedrops in your average SF novel, given the prefatory material:
And I’m not alleging that it is “just namedropping like your average SF novel.” I’m taking the references seriously. But, when I try to view the references as load-bearing pieces in a structure, I can’t make out what that structure is supposed to be.
Relatedly, the story does the gish-gallop thing where many of the links do not actually support the claim they are called on to support. For example, in “learning implicit tree search à la MuZero”, the link to MuZero does not support the claim that MuZero learns implicit tree search. (Originally the link directed to the MuZero paper, which definitely does not do implicit tree search, since it has explicit tree search hard-coded in; now the link goes to gwern’s page on MuZero, which a collection of many papers and it is unclear which one is about learning to do implicit tree search. Note that as far as I know, every Go program that can beat humans has tree search explicitly built in, so implicit tree search is not really a thing.)
I don’t agree with your read of the MuZero paper.
The training routine of MuZero (and AlphaZero etc) uses explicit tree search as a source of better policies than the one the model currently spits out, and the model is adapted to output these better policies.
The model is trying to predict the output of the explicit tree search. There’s room to argue over whether or not it “learns implicit tree search” (ie learns to actually “run a search” internally in some sense), but certainly the possibility is not precluded by the presence of the explicit search; the only reason the explicit search is there at all is to give the model a signal about what it should aspire to do without explicit search.
It’s also true that, when the trained models are run in practice, they are usually run with explicit search on top, and this improves their performance. This does not mean they haven’t learned implicit search—only that a single forward pass of the model cannot do as well as a search guided by many forward passes of the same model, which is not a surprising outcome for any model (even models which do some kind of search inside each forward pass).
You’re at most making the claim that MuZero attempts to learn tree search. Does the MuZero paper provide any evidence that MuZero in fact does implicit tree search? I think not, which means it’s still misleading to link to that paper while claiming it shows neural nets can learn implicit tree search (I don’t particularly doubt the can learn it a bit, but I do contest the implication that MuZero does so to any substantial degree or that a non-negligible part of its strength comes from learning implicit tree search).
Edit: I should clarify what would change my mind here. If someone could show that MuZero (or any scaled-up variant of it) can beat humans at Go with the neural-net model alone (without the explicit tree search on top), I would change my mind. To my knowledge, no paper is currently claiming this, but let me know if I am wrong. Since my understanding is that the neural nets alone cannot beat humans, my interpretation is that the neural net part is providing something like roughly human-level “intuition” about what the right move should be, but without any actual search, so humans can still outperform this intuition machine by doing explicit search; but once you add on the tree search, the machines crush humans due to their speed.