I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
I’ll reiterate what I wrote before: “No matter how many tokens are appended to the end of the CoT, you still have the issue that, each time you do a new forward pass, the LLM looks at its context window (textbooks + CoT scratchpad) ‘with fresh eyes’, and what it sees is a bunch of unintelligible gobbledygook that it has only the duration of one forward pass to make sense of.”
Probably the linear algebra textbooks in the context window already say that “you can think of a matrix as a linear transformation… <more explanation>”, right?
And this points to a key idea: The CoT-so-far in the context window is not a fundamentally different kind of thing from the textbooks in the context window. It’s just more tokens.
So we can consider the “textbooks + CoT-so-far” as a kind of “extended textbook”. And the LLM has one forward pass to read that “extended textbook” and then output a useful token. And that token will probably not be useful if the LLM does not understand (the relevant part of) linear algebra.
Granted, some textbooks are better than other textbooks. But I don’t think there exists any linear algebra textbook (or “extended textbook”) that gets around the “understanding linear algebra requires more serial steps than there are in a forward pass” problem (i.e., you can’t understand eigenvectors without first understanding matrix multiplication etc.). So CoT doesn’t help. A CoT-in-progress is just a different possible context window. And my claim is that there is no possible context window that can explain eigenvectors within a single forward pass to an LLM that has never seen any linear algebra.
(Again, this is a very different situation from a human writing down notes.)
You’re right, thanks, I have now edited that paragraph to also talk about how Thought Assessors might fit in.
a big reason why human brain algorithms are so efficient is that evolution did a lot of precomputation beforehand, so a large part of the learning is already done, pre-computed in the genome
I disagree for reasons discussed here.
My understanding (you can correct me) is that information can never travel from later layers to earlier layers, e.g. information cannot travel from token location 12 at layer 7 to token location 72 at layer 4. Right? So that means:
layer 1 can use information in the weights + raw token values
layer 2 can use information in the weights + stuff that was figured out in (earlier & current token positions of) layer 1
layer 3 can use information in the weights + stuff that was figured out in (earlier & current token positions of) layers 1 & 2
Etc. Right?
This is the sense in which I was saying that the linear algebra textbook is gobbledygook. Layer 1 starts from scratch, then layer 2 has to build on only layer 1, etc.
It’s true that different token-positions in layer 1 can be figuring out multiple things in parallel. But I claim that some things really need to be understood serially. I don’t expect any part of the architecture to be able to make meaningful progress towards understanding eigenvectors, if it doesn’t ALREADY know something about matrices, and matrix multiplication, etc., from previous layers.
So I claim the number of layers imposes a bottleneck on serial steps, and that this is a meaningful bottleneck on parsing interrelated concepts that are not in the weights, such as linear algebra in this thought-experiment.
How does that relate to what you wrote?
Hmm, good point, I guess I was a bit sloppy in jumping around between a couple different things that I believe, instead of keeping the argument more tight and precise.
One thing I believe is: “LLMs are predominantly powered by imitation learning”. I didn’t argue for that in the post, but my argument would be basically this comment + one more paper along the same lines + “Most Algorithmic Progress is Data Progress” (+ further discussion in The nature of LLM algorithmic progress §1.4). I don’t feel super-duper strongly and am not defining what “predominantly” means here in any case.
Another thing I believe is: “You can’t imitation-learn how to continual-learn”. This is independent of how to think about LLM post-training. I regularly come across people who disagree, on a conceptual level, with this claim, so it seemed worth sharing. Indeed, I now know that there’s a whole little subfield of “algorithm distillation” and “in-context RL”, and my claim (having now read three such papers, see other comments on this page) is that this whole subfield is a dumpster fire where the big idea doesn’t really work but people keep publishing misleading importance-hacked papers anyway.
Another thing I believe is: “You can’t meta-learn how to continual-learn”, which is a more general claim because it includes RLVR. This stronger claim is actually what follows from the boldface sentence in the post: “The only practical way to know what happens after millions of steps of some scaled-up continual learning algorithm is to actually do millions of steps of that same scaled-up continual learning algorithm, with actual weights getting actually changed in specifically-designed ways via PyTorch code. And then that’s the scaled-up learning algorithm you’re running. Which means you’re not doing [some different learning algorithm].”
Another thing I believe is: “If you put lots of interrelated complex concepts, none of which appear anywhere in the pretraining data, into the context window, then LLMs would crash and burn; rather, the only way for an LLM to fluently use a set of concepts is for all (or at least almost all) of those concepts to be in the weights, not the context window, because they were already used properly a bunch of times in the training data.” I allude to that in the post and elaborate on it in this other comment. This implies that context windows and scratchpads cannot substitute for weight-updates, and that a “country of geniuses in a datacenter” (who would presumably be inventing entirely new fields of science etc.) cannot consist of LLMs with very long context windows in a sealed box for the equivalent of 100 years with no human intervention.
Another thing I believe is: there’s no way to close the loop such that a closed system of LLMs can come up with new useful concepts and get those concepts into their own weights, e.g. open-ended self-distillation setups won’t work on LLMs. But that’s definitely off-topic for this OP. Self-distillation setups would be a bona fide continual learning algorithm, by the standards of this OP, e.g. there’s PyTorch code for continual weight updates. Whether that setup would actually work in practice, and how far it would go, are a different question.
So the main points of this OP are basically 2, 3, and 4, which are all pretty related. Plus the stuff about how to think about continual learning in general.
You’re treating the Michael Druggan quote at the very end as obviously terrible, whereas I see it as obviously sensible. I’m confused. Maybe I’m missing some context? Are you reading in a subtext that Druggan wants the superintelligences to exist, instead of conditioning on the superintelligences existing and talking about what that world would or should be like?
If we assume that the Singularity has happened, and that radical superintelligence exists, and (for the sake of argument) that humans still exist too, … then your position is that humans should still be making consequential decisions about post-Singularity economic policy, legal frameworks, etc.? Really?
Hmm, thinking about it more, I can imagine good-seeming futures in which e.g. there’s a Singleton ASI which enforces hard boundaries (especially against creating other ASIs), but allows lots of human agency within those boundaries (cf. Long Reflection, or Archipeligo, or Nanny AI, or Fun Theory Utopia, etc.). But I wouldn’t exactly call that “humans remain in control”. Or at least, it’s not a central example of that. What other options are there, assuming ASI exists at all?
In any multipolar ASI scenario, the economy and world would presumably be changing at ASI speed, and having excruciatingly slow humans “in control” seems unworkable.
For example the IAEA has heavily curtailed research into how to build nuclear weapons more cheaply and efficiently, which seems like it applies pretty straightforwardly to algorithmic progress.
IIUC, it’s legal everywhere on Earth to do basic research that might eventually lead to a new, much more inexpensive and hard-to-monitor method to enrich uranium to weapons grade.
I’m thinking mainly of laser isotope enrichment, which was first explored in the 1970s. No super-inexpensive method has turned up, thankfully. (The best-known approach seems to be in the same ballpark as gas centrifuges in terms of cost, specialty parts etc., or if anything somewhat worse. Definitely not radically simpler and cheaper.) But I think there’s a big space of possible techniques, and meanwhile people in academia keep inventing new types of lasers and new optical excitation and separation paradigms. I don’t think there’s any general impossibility proof that kg-scale uranium enrichment in a random basement with only widely-available parts can’t ever get invented someday by this line of research.
(If it did, it probably wouldn’t be the death of nonproliferation because you can still try to monitor and control the un-enriched uranium. But it would still make nonproliferation substantially harder. By the way, once you have a lot of weapons grade uranium, making a nuclear bomb is trivial. The fancy implosion design is only needed for plutonium bombs not uranium.)
AFAICT, if someone is explicitly developing a system for “kg-scale uranium enrichment via laser isotope separation”, then the authorities will definitely go talk to them. But for every step prior to that last stage, where you’re doing “basic R&D”, building new types of lasers, etc., then my impression is that people can freely do whatever they want, and publish it, and nobody will ask questions. I mean, it’s possible that there’s someone in some secret agency who is on the ball, five steps ahead of everyone in academia and industry, and they know where problems might arise on the future tech tree and are ready to quietly twist arms if necessary. But I dunno man, that seems pretty over-optimistic, especially when the research can happen in any country.
My former PhD advisor wrote a book in the 1980s with a whole chapter on laser isotope separation techniques, and directions for future research. The chapter treats it as completely unproblematic! Not even one word about why this might be bad. I remember feeling super weirded out by that when I read it (15 years ago), but I figured, maybe I’m the crazy one? So I never asked him about it.
(Low confidence on all this.)
Whoops, the wikipedia article was deleted a few months ago.
I meant “kinda the same idea” in the sense that, at the end of the day, a similar problem is being solved by the communicative signal. I agree that there’s a sign-flip.
Anyway, I’ll reword, thanks.
Sure, if you have an RNN (e.g. SSM) with a (say) billion-dimensional hidden state, then in principle the hidden state could imitate the billion weights of some other entirely different learning algorithm, and the RNN propagation steps could imitate the weight-update steps (e.g. gradient descent or TD learning or whatever) of that other learning algorithm, along with the querying-the-model steps, the replay-learning steps, and/or whatever else is involved.
But I have a rather strong belief that this would never happen, in real life, in any practical, AGI-relevant sense. Even if such an RNN update step exists in principle, I think it would not be learnable in practice, nor runnable without many orders of magnitude of performance overhead. I won’t get into details here, but this old discussion of mine is vaguely related.
some evidence of this: https://arxiv.org/abs/2506.13892
I’m sorry, but the more I read about “algorithm distillation”, the more I want to treat that term as a giant red flag that the paper is probably garbage. I cited this example in the post (which is I think overly diplomatic), and for a second one see my discussion thread with glazgogabgolab on this page.
Basically, nobody in that subfield seems to be carefully distinguishing “learning object-level things from the teacher” versus “learning how to learn from the teacher”. The second is exciting, the first is boring.
As far as I can tell, “in-context reinforcement learning” has never been demonstrated to exist at all, at least in the sense that matters. I.e., real RL algorithms can figure out how to do complicated new things that they’ve never seen demonstrated, whereas the so-called “ICRL” models seem to only be capable of doing things very similar to what they’ve seen the teacher do in their context window.
…And this paper does not change my mind on that. For example, in figure 1, none of the four learning curves shows the student doing better than it saw the teacher do within its context window.
Even outside of that graph, I really think that if the ICLR agent was using some innovative clever strategy that the teacher never used, the way actual RL algorithms do every day, then the authors would have noticed that, and been very excited by it, and centered their whole paper around it, all the way up to the title. The fact that they don’t mention anything like that is I think a strong sign that it didn’t happen.
I think it’s important that the AI doesn’t need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like “next time I run into a bug like this, I should look in this place first.”
I don’t think that’s adequate for (what I was calling) “real” continual learning. There’s a trivial sense in which an LLM can do anything via a context window because it can e.g. emulate a Turing Machine without understanding what the Turing Machine is doing. But that’s not realistic (nor alignment-relevant). Realistically, I claim LLM “understanding” has to be in the weights, not the context window.
Here’s a thought experiment I often bring up: imagine training an LLM purely on data before linear algebra existed (or equivalently, train a new LLM from scratch while carefully filtering out anything related to or downstream of linear algebra from the training data). Then put a linear algebra textbook (or many textbooks) in the context window.
My question is: can the LLM can answer tricky questions that are not directly in those textbooks, to build on those linear algebra ideas and make further progress?
My strong prediction is: No.
Why do I think that? The issue is: linear algebra is a giant pile of interrelated concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, etc. Any one sentence in the textbook makes no sense to someone who doesn’t already know some linear algebra, because it’s probably describing some connection between one nonsensical concept and another nonsensical concept.
E.g. here’s a sentence from a linear algebra textbook: “As a reminder, for any matrix M , and a matrix M′ equal to M after a row operation, multiplying by an elementary matrix E gave M ′ = EM.” Try looking at that sentence through the eyes of someone who has never heard the words “matrix”, “row operation”, etc. It’s totally unintelligible gobbledygook, right?
The LLM needs to somehow make sense of this gobbledygook within the duration of a single forward pass, well enough to write down the first token on its scratchpad.
Now we do the second forward pass to add the second token to the CoT. But the weights haven’t changed! So the textbook is still gobbledygook! And the LLM still has only the duration of one forward pass to make sense of it.
No matter how many tokens are appended to the end of the CoT, you still have the issue that, each time you do a new forward pass, the LLM looks at its context window (textbooks + CoT scratchpad) “with fresh eyes”, and what it sees is a bunch of unintelligible gobbledygook that it has only the duration of one forward pass to make sense of.
Even if it somehow manages to print out some tokens that constitute progress on the linear algebra problem, those very tokens that it just printed out will also be gobbledygook, when it looks at them “with fresh eyes” on the next forward pass.
By contrast, if you give a human the same problem, i.e. she doesn’t know linear algebra but she has these textbooks and a scratchpad, she would be able to make progress on the problem, as long as you give her enough time (probably weeks or months), but she would make progress in a very different way from LLM CoT inference: she would be learning as she goes, changing the “weights” in her brain. After a few weeks, she could look at a sentence in the textbook, and it would no longer be unintelligible gobbledygook, but rather describing something about concepts that she is beginning to understand, and she can thus refine her understanding more and more. And likewise, if she writes down notes on her scratchpad, she will be able to understand those notes afterwards, because she has been learning (changing the weights) the whole time. The learning (changing weights) is the essential part, the scratchpad is incidental and optional. A scratchpad without “real” continual learning (changing weights) would be useless to her. Indeed, if she could time-travel to her past self, who didn’t yet know anything about linear algebra, and gift her own scratchpad to her past self, it wouldn’t help much. Her past self would still need to spend weeks learning all these new concepts. Indeed, time-traveled-notes-to-self is kinda what a textbook is—but owning a library full of unread math textbooks does not make someone a mathematician :-)
OK, so that’s my hypothesis: the linear-algebra-holdout LLM experiment would definitely fail. Nobody has done that experiment, but I claim that my guess is consistent with observations of actual LLMs:
For one thing, we might notice that companies care an awful lot about pretraining data (1,2), spending billions of dollars a year on it, which dovetails with my theory that LLMs are generally great at using concepts that already exist in the pretraining data, but bad at inventing and using new concepts that aren’t. It’s just that there’s so much pretraining data that you can do quite a lot without ever exiting the concept space that exists in the pretraining data.
For another thing, at least some brilliant people doing bleeding-edge stuff report that, when you’re doing something sufficiently innovative, LLMs get confused and fall back to concepts in the pretraining data. Relatedly, mathematicians seem to agree that LLMs, for all their impressive achievements, have not been coming up with new useful conceptual frames. See discussion here.
For another thing, I think it’s widely agreed that LLMs are best at self-contained tasks, and at things that have been done lots of times before, and that the more you get into weird idiosyncratic proprietary codebases, with lots of interrelated complexities that are not anywhere on the internet, the more likely they are to fail. This likewise seems to fit my theory that LLMs get “real understanding” ~only from the pretraining process, and that they crash and burn when the context window has lots of interconnected layered complexity that differs from anything in the pretraining data.
My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don’t leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)
No, I believe something stronger than that, because I don’t think “externalized traces of their continual learning process” is relevant. I think that in the linear algebra holdout thought experiment above, LLMs would fail equally hard if we digitize Arthur Cayley’s notes from when he was inventing the matrix in the 1800s and put it into the context window, along with Hermann Grassmann’s notes etc. That’s not relevant.
It refers a few paragraphs earlier, i.e.: “I originally had two justifications for putting this in. (1) Provine found that laughter was 30× less frequent when people were alone; (2) Evolutionarily, there’s no point in emitting communicative signals when there’s no one around to hear them.” (Sorry that’s unclear, I’ll reword.)
This post contains no plan for technical AGI alignment (or anything else). I have no such plan. See the last two paragraphs of the post.
I am trying to find such a plan (or prove that none exists), and in the course of doing so, occasionally I come across a nugget of deconfusion that I want to share :-) Hence this post.
As a general rule, I take interest in certain things that humans sometimes do or want, not because I’m interested in copying those things directly into AGIs, but rather because they are illustrative case studies for building my nuts-and-bolts understanding of aspects of motivation and learning etc. And then I can use that understanding to try to dream up some engineered system that might be useful in AGIs. The resulting engineered system might or might not resemble anything in humans or biology. By analogy, the Wright Brothers learned a lot from soaring birds, but their plane did not look like a bird.
I think they’re mainly trying to win approval of other actual humans.
I think what people “mainly” do is not of much interest to me right now. If a few people sometimes do X, then it follows that X is a possible thing that a brain can do, and then I can go try to figure out how the brain does that, and maybe learn something useful for technical alignment of brain-like AGI.
So along those lines: I think that there exist people who have a self-image as a person with such-and-such virtue, and take pride in that, and will (sometimes) make decisions driven by that self-image even when they have high confidence that nobody will ever find out, or worse, when they have high confidence that the people they care most about will despise them for it. They (sometimes) make that decision anyway.
I think this kind of self-image-related motivation has a deep connection to other people’s approval, and is causally downstream of their experience of such approval over a lifetime. But it is definitely NOT the same as consequentialist planning to maximize future approval / status.
is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I say yes. You left out an important part, here it is in italics: “is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm churning for millions of steps?”
Yes, because an awful lot can happen in millions of steps, including things that build on each other in a serial way.
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it.
I disagree that it should be called “learning” at all. It would be “learning” for a human in real life, but if you imagine a person who has read 2 billion lines of code [that’s the amount of GitHub code in The Pile … actually today’s LLMs probably see way more code than that], which would correspond to reading code 24 hours a day for 100 years, then I believe that such a person could do the METR 8 hour tasks without “learning” anything new whatsoever. You don’t need to “learn” new things to mix-and-match things you already know in novel ways—see my example here of “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”. And see also: related discussion here.
why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
Yup, that’s my main point in this post, I expect that sooner or later somebody will invent real-deal continual learning, and it will look like a bona fide learning algorithm written in PyTorch with gradient descent steps and/or TD learning steps and/or whatever else, as opposed to (so-called) “in-context learning” or RAG etc.
Thanks, I just deleted that whole part. I do believe there’s something-like-that which is true, but it would take some work to pin down, and it’s not very relevant to this post, so I figure, I should just delete it. :-)
In case anyone’s curious, here’s the edit I just made:
OLD VERSION:
Anyway, by assuming “brain-like AGI”, I get the right to make certain assumptions about the cognitive architecture, representations, learning algorithms, and so on.
Some of these “brain-like AGI ingredients” are universal parts of today’s popular ML algorithms (e.g. learning algorithms; distributed representations).
Others of these “brain-like AGI ingredients” are (individually) present in a subset of today’s popular ML algorithms but absent from others (e.g. reinforcement learning; predictive [a.k.a. self-supervised] learning; explicit planning).
Still others of these “brain-like AGI ingredients” seem mostly or totally absent from today’s most popular ML algorithms (e.g. ability to form “thoughts” [e.g. “I’m going to the store”] that blend together immediate actions, short-term predictions, long-term predictions, and flexible hierarchical plans, inside a generative world-model that supports causal and counterfactual and metacognitive reasoning).
So in this sense, “brain-like AGI” is a specific thing that might or might not happen, independently of “prosaic AGI”. Much more on “brain-like AGI”, or at least its safety-relevant aspects, in the subsequent posts.
NEW VERSION:
Anyway, by assuming “brain-like AGI”, I get the right to make certain assumptions about the cognitive architecture, representations, learning algorithms, and so on. Some of those assumptions would also apply to some existing AI algorithms. But if you take the whole package together—all the parts and how they interconnect—it constitutes a yet-to-be-invented AI architecture. So in this sense, “brain-like AGI” is a specific thing that might or might not happen, independently of “prosaic AGI”. Much more on “brain-like AGI”, or at least its safety-relevant aspects, in the subsequent posts.
No worries, seems fine.
FWIW, my current feeling is like 25% probability that narrowing eyes (in anger etc.) has a functional explanation related to vision (as opposed to changing how your face looks to other people, or defending your eyes from attack, or whatever), and 80% probability that widening eyes (in fear etc.) has a functional explanation related to vision. But I didn’t think about it too hard.
In both cases, regardless of whether it’s functional or not, I have very high confidence that it’s an innate reaction, not a product of within-lifetime learning.
I think the point I was trying to make in this post is both narrower and weirder than the general topics of humans supervising more competent AIs, and generation-verification gaps. For example, my self-image might be partly formed from admiration of the character traits of a cartoon character, or Jesus, etc., and I might feel pride in acting in ways that I imagine them approving of, and that might influence how I go about my day-to-day conduct as a string theory researcher. But Jesus is long gone, and the cartoon character doesn’t even exist at all, and certainly neither was able to evaluate string theory ideas. They’re not “supervising” me in that sense.
Actual humans supervising actual AGIs is something that Paul talked about in IDA stuff, and like I said in the OP, I reject that entire line of research as a dead end.
Separately, I agree that “humans are an existence proof that safe & beneficial brain-like AGI is possible in principle” needs a heavy dose of nuance and caveats (humans are working towards misaligned AGI right now, plus I’d generally expect tech progress to drive humanity off the rails even without AGI or other destructive tech, among other things). But I do think there is some “…existence proof…” argument that goes through. E.g. at least some humans are making the overall situation better not worse (or if not, then we’re screwed no matter what), and AGIs don’t have to match the human population distribution.
Huh, I find the disgust example pretty plausible.
I agree that “universality-through-functionality” (§4.2) is implausible as a theory explaining all universal facial expressions. At least some universal facial expressions do not have immediate functional explanations, seems to me. E.g. the angry open-mouth tooth-showing grimace / scowl was presumably functional in chimps, because they’re showing off their fangs as a credible signal that they’re dangerous. We don’t have any fangs to show off, but we still have that same expression.
But I’m also sympathetic to there being more than zero universal facial expressions that do have immediate functional explanations. Not sure if I’m disagreeing with you or not.
(I don’t currently have a strong opinion one way or the other about whether Barrett’s claims here are plausible.)
I tried narrowing my eyes. This does not help improve my vision.
Well there is a relation between the aperture of a camera and its depth-of-field. (Famous example: pinhole cameras can focus any depth despite having no lens at all.) (Another famous example: I think this is why people squint when they aren’t wearing their glasses.) If the story is real at all, it might be more apparent in a dark environment, since then your pupil will be dilated, and also more apparent when trying to view something that has both near and far parts such that you can’t focus on both simultaneously. Yes it’s possible that this is too subtle an effect to matter in practice, I’m just trying to steelman it. I can’t immediately think of a DIY demo to try.
Widening my eyes does not seem to improve my peripheral vision.
Hmm, I think I disagree. I think there’s a part of my field-of-view (mainly the top part obviously) that’s black when my eyes are relaxed, but that I can see when my eyes are widened. As usual with peripheral vision, you kinda have to be paying attention to it, and it’s also easier to notice when there’s something moving. Here’s the procedure I tried just now: Hold your head straight, pick a fixation point in front of you (or better yet downward), and hold your hand with wiggling fingers as high up as it can go until you can’t see the wiggling, repeat with and without widening your eyes. Seemed like a nonzero effect to me (but not huge).
See also a thread here where I was also complaining about this.
Nah, my model allows ASI without massive compute at any point in the process, see “Foom & Doom 1: ‘Brain in a box in a basement’” (esp. §1.3), and maybe also “The nature of LLM algorithmic progress” §4.