I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Steven Byrnes
In the post you say that human programmers will write the AI’s reward function and there will be one step of indirection (and that the focus is the outer alignment problem).
That’s not quite my position.
Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.
I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.
But it seems likely to me that they programmers won’t know what code to write for the reward function since it would be hard to encode complex human values…
I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:
(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.
(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.
Thanks!
This is a surprising prediction because it seems to run counter to Rich Sutton’s bitter lesson which observes that, historically, general methods that leverage computation (like search and learning) have ultimately proven more effective than those that rely on human-designed cleverness or domain knowledge. The post seems to predict a reversal of this long-standing trend (or I’m just misunderstanding the lesson), where a more complex, insight-driven architecture will win out over simply scaling the current simple ones.
No, I’m also talking about “general methods that leverage computation (like search and learning)”. Brain-like AGI would also be an ML algorithm. There’s more than one ML algorithm. The Bitter Lesson doesn’t say that all ML algorithms are equally effective at all tasks, nor that there are no more ML algorithms left to discover, right? If I’m not mistaken, Rich Sutton himself is hard at work trying to develop new, more effective ML algorithms as we speak. (alas)
Part of the problem IMO is that both sides of that conversation seem to be mainly engaging in a debate over “yay Bengio” versus “boo Bengio”. That kind of debate (I call it “vibes-based meaningless argument”) is not completely decision-irrelevant, but IMO people in general are motivated to engage in it way out of proportion to its very slight decision-relevance†. Better to focus on what’s really decision-relevant here: mainly “what is the actual truth about reward tampering?” (and perhaps “shall I email the authors and bug them to rewrite that section?” or “shall I publicly complain?”, but probably not, that’s pretty rare).
†For example, it’s slightly useful to have a “general factor of Bengio being correct about stuff”, since that can inform how and whether one will engage with other Bengio content in the future. But it’s not that useful, because really Bengio is gonna be correct about some things and incorrect about other things, just like everyone is.
Good questions!
Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech?
I think you’re thinking about that kinda the wrong way around.
You’re treating “the things that Zoe does when she wants to get my attention” as a cause, and “my brain reacts to that” as the effect.
But I would say that a better perspective is: everybody’s brain reacts to various cues (sound level, pitch, typical learned associations, etc.), and Zoe has learned through life experience how to get a person’s attention by tapping into those cues.
So for example: If Zoe says “hey” to me, and I don’t notice, then Zoe might repeat “hey” a bit louder, higher-pitched, and/or closer to my head, and maybe also wave her hand, and maybe also poke me.
The wrong question is: “how does my brain know that louder and higher-pitched and closer sounds, concurrent with waving-hand motions and pokes, ought to trigger an orienting reaction?”.
The right perspective is: we have these various evolved triggers for orienting reactions, whose details we can think of as arbitrary (it’s just whatever was effective for noticing predators and prey and so on), and Zoe has learned from life experience various ways to activate those triggers in other people.
If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you?
Yup, STEP 1 is one of my “thought assessors” (probably somewhere in the amygdala) has learned from life experience that hearing my own name should trigger orienting to that sound; and then STEP 2 is that Zoe in turn has learned from life experience that saying someone’s name is a good way to get their attention.
Thanks!
Hmm, here’s a maybe-interesting example (copied from other comment):
If an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible to me.
What’s happening is that this example is in the category “I want the world to continuously retain a certain property”. That’s a non-indexical desire, so it works well with self-modification and successors. But it’s also not-really-consequentialist, in the sense that it’s not (just) about the distant future, and thus doesn’t imply instrumental convergence (or at least doesn’t imply every aspect of instrumental convergence at maximum strength).
(This is a toy example to illustrate a certain point, not a good AI motivation plan all-things-considered!)
Speaking of which, is it possible to get stability w.r.t. successors and self-modification while retaining indexicality? Maybe. I think things like “I want to be virtuous” or “I want to be a good friend” are indexical, but I think we humans kinda have an intuitive notion of “responsibility” that carries through to successors and self-modification. If I build a robot to murder you, then I didn’t pull the trigger, but I was still being a bad friend. Maybe you’ll say that this notion of “responsibility” allows loopholes, or will collapse upon sufficient philosophical understanding, or something? Maybe, I dunno. (Or maybe I’m just mentally converting “I want to be a good friend” into the non-indexical “I want you to continuously thrive”, which is in the category of “I want the world to continuously retain a certain property” mentioned above?) I dunno, I appreciate the brainstorming.
I would think that reproducing cortical learning would require a good deal of work and experimentation, and I wouldn’t expect working out the “algorithm” to happen all at once
I agree; working out the “algorithm” is already happening, and has been for decades. My claim instead is that by the time you can get the algorithm to do something importantly useful and impressive—something that LLMs and deep learning can’t already do much cheaper and better—then you’re almost at ASI. Note that we have not passed this threshold yet (no offense). See §1.7.1.
or to be vastly more efficient than LLMs (since they are optimized for the computers used to simulate them, whereas cortical learning is optimized for the spatially localized processing available to biology
I think people will try to get the algorithms to work efficiently on computers in the toy-model phase, long before the algorithms are doing anything importantly useful and impressive. Indeed, people are already doing that today (e.g.). So in that gap between “doing something importantly useful and impressive” and ASI, people won’t be starting from scratch on the question of “how do we make this run efficiently on existing chips”, instead they’ll be building on all the progress they made during the toy-model phase.
In §1.8.1 I also mentioned going from zero to beyond-world-expert-level understanding of cryptocurrency over the course of 24 hours spent reading up on the topic and all its prerequisites, and playing with the code, etc.
And in §3.2 here I talked about a hypothetical AI that, by itself, is running the equivalent of the entire global human R&D enterprise, i.e. the AI is running hundreds of thousands of laboratory experiments simultaneously, 24 hours a day, publishing millions of papers a year, and synthesizing brilliant new insights in every field at once.
Can we agree that those examples would be “rapidly shooting past human-level capability”, and would not violate any laws of probability by magically acquiring knowledge that it has no way to know?
Thanks! Here’s a partial response, as I mull it over.
Also, I’d note that the brain seems way more complex than LLMs to me!
See “Brain complexity is easy to overstate” section here.
basically all paradigms allow for mixing imitation with reinforcement learning
As in the §2.3.2, if an LLM sees output X in context Y during pretraining, it will automatically start outputting X in context Y. Whereas if smart human Alice hears Bob say X in context Y, Alice will not necessarily start saying X in context Y. Instead she might say “Huh? Wtf are you talking about Bob?”
Let’s imagine installing an imitation learning module in Alice’s brain that makes her reflexively say X in context Y upon hearing Bob say it. I think I’d expect that module to hinder her learning and understanding, not accelerate it, right?
(If Alice is able says to herself “in this situation, Bob would say X”, then she has a shoulder-Bob, and that’s definitely a benefit not a cost. But that’s predictive learning, not imitative learning. No question that predictive learning is helpful. That’s not what I’m talking about.)
…So there’s my intuitive argument that the next paradigm would be hindered rather than helped by mixing in some imitative learning. (Or I guess more precisely, as long as imitative learning is part of the mix, I expect the result to be no better than LLMs, and probably worse. And as long as we’re in “no better than LLM” territory, I’m off the hook, because I’m only making a claim that there will be little R&D between “doing impressive things that LLMs can’t do” and ASI, not between zero and ASI.)
Noteably, in the domains of chess and go it actually took many years to make it through the human range. And, it was possible to leverage imitation learning and human heuristics to perform quite well at Go (and chess) in practice, up to systems which weren’t that much worse than humans.
In my mind, the (imperfect!) analogy here would be (LLMs, new paradigm) ↔ (previous Go engines, AlphaGo and successors).
In particular, LLMs today are in many (not all!) respects “in the human range” and “perform quite well” and “aren’t that much worse than humans”.
algorithmic progress
I started writing a reply to this part … but first I’m actually kinda curious what “algorithmic progress” has looked like for LLMs, concretely—I mean, the part where people can now get the same results from less compute. Like what are the specific things that people are doing differently today than in 2019? Is there a list somewhere? A paper I could read? (Or is it all proprietary?) (Epoch talks about how much improvement has happened, but not what the improvement consists of.) Thanks in advance.
The second line is about learning speed and wall-clock time. Of course AI can communicate and compute orders of magnitude faster than humans, but there are other limiting factors to learning rate. At some point, the AI has to go beyond the representations that can be found or simulated within the digital world and get its own data / do its own experiments in the outside world.
Yup, I addressed that in §1.8.1:
To be clear, the resulting ASI after those 0–2 years would not be an AI that already knows everything about everything. AGI and ASI (in my opinion) aren’t about already knowing things, but rather they’re about not knowing things, yet being able to autonomously figure them out (§1.7.1 above). So the thing we get after the 0–2 years is an AI that knows a lot about a lot, and if it wants to dive deeper into some domain, it can do so, picking it up with far more speed, depth, and insight than any human could. …
It continues on (and see also §3.2 here). I agree that ASI won’t already know things about the world that it has no way of knowing. So what? Well, I dunno, I’m not sure why you brought it up. I’m guessing we have some disagreements about e.g. how quickly and easily an ASI could wipe out humans and survive on its own? Or were you getting at something else?
Cool paper, thanks!
training a group of LLMs…
That arxiv paper isn’t about “LLMs”, right? Really, from my perspective, the ML models in that arxiv paper have roughly no relation to LLMs at all.
Is this a load-bearing part of your expectation of why transformer-based LLMs will hit a scaling wall?
No … I brought this up to make a narrow point about imitation learning (a point that I elaborate on much more in §2.3.2 of the next post), namely that imitation learning is present and very important for LLMs, and absent in human brains. (And that arxiv paper is unrelated to this point, because there is no imitation learning anywhere in that arxiv paper.)
Yeah, that’s part of it. Also maybe that it can have preferences about “state of the world right now and in the immediate future”, and not just “state of the world in the distant future”.
For example, if an ASI wants me to ultimately wind up with power, that’s a preference about the distant future, so its best bet might be to forcibly imprison me somewhere safe, gather maximum power for itself, and hand that power to me later on. Whereas if an ASI wants me to retain power continuously, then presumably the ASI would be corrigible. But “me retaining power” is something about the state of the world, not directly about the ASI’s strategies and plans, IMO.
(Also, “expect” is not quite right, I was just saying that I don’t find a certain argument convincing, not that this definitely isn’t a problem. I’m still pretty unsure. And until I have a concrete plan that I expect to work, I am very open-minded to the possibility that finding such a plan is (even) harder than I realize.)
I don’t think GPUs would be the best of all possible chip designs for the next paradigm, but I expect they’ll work well enough (after some R&D on the software side, which I expect would be done early on, during the “seemingly irrelevant” phase, see §1.8.1.1). It’s not like any given chip can run one and only one algorithm. Remember, GPUs were originally designed for processing graphics :) And people are already today running tons of AI algorithms on GPUs that are not deep neural networks (random example).
Can you expand your argument why LLM will not reach AGI?
I’m generally not very enthusiastic about arguing with people about whether LLMs will reach AGI.
If I’m talking to someone unconcerned about x-risk, just trying to make ASI as fast as possible, then I sure don’t want to dissuade them from working on the wrong thing (see §1.6.1 and §1.8.4).
If I’m talking to someone concerned about LLM x-risk, and thus contingency planning for LLMs reaching AGI, then that seems like a very reasonable thing to do, and I would feel bad about dissuading them too. After all, I’m not that confident—I don’t feel good about building a bigger community of non-LLM-x-risk mitigators by proselytizing people away from the already-pathetically-small community of LLM-x-risk mitigators.
…But fine, see this comment for part of my thinking.
To create ASI you don’t need a billion of Sam Altmans, you need a billion of Ilya Sutskevers.
I’m curious what you imagine the billion Ilya Sutskevers are going to do. If you think they’re going to invent a new better AI paradigm, then we have less disagreement than it might seem—see §1.4.4. Alternatively, if you think they’re going to build a zillion datasets and RL environments to push the LLM paradigm ever farther, then what do you make of the fact that human skill acquisition seems very different from that process (see §1.3.2 and this comment)?
New large-scale learning algorithms can in principle be designed by (A) R&D (research taste, small-scale experiments, puzzling over the results, iterating, etc.), or (B) some blind search process. All the known large-scale learning algorithms in AI to date, from the earliest Perceptron to the modern Transformer, have been developed by (A), not (B). (Sometimes a few hyperparameters or whatever are set by blind search, but the bulk of the real design work in the learning algorithm has always come from intelligent R&D.) I expect that to remain the case: See Against evolution as an analogy for how humans will create AGI.
Or if you’re talking about (A), but you’re saying that LLMs will be the ones doing the intelligent R&D and puzzling over learning algorithm design, rather than humans, then … maybe kinda, see §1.4.4.
Sorry if I’m misunderstanding.
I definitely don’t think we’ll get AGI by people scrutinizing the human genome and just figuring out what it’s doing, if that’s what you’re implying. I mentioned the limited size of the genome because it’s relevant to the complexity of what you’re trying to figure out, for the usual information-theory reasons (see 1, 2, 3). “Machinery in the cell/womb/etc.” doesn’t undermine that info-theory argument because such machinery is designed by the genome. (I think the epigenome contains much much less design information than the genome, but someone can tell me if I’m wrong.)
…But I don’t think the size of the genome is the strongest argument anyway.
A stronger argument IMO (copied from here) is:
Humans can understand how rocket engines work. I just don’t see how some impossibly-complicated-Rube-Goldberg-machine of an algorithm can learn rocket engineering. There was no learning rocket engineering in the ancestral environment. There was nothing like learning rocket engineering in the ancestral environment!! Unless, of course, you take the phrase “like learning rocket engineering” to be so incredibly broad that even learning toolmaking, learning botany, learning animal-tracking, or whatever, are “like learning rocket engineering” in the algorithmically-relevant sense. And, yeah, that’s totally a good perspective to take! They do have things in common! “Patterns tend to recur.” “Things are often composed of other things.” “Patterns tend to be localized in time and space.” You get the idea. If your learning algorithm does not rely on any domain-specific assumptions beyond things like “patterns tend to recur” and “things are often composed of other things” or whatever, then just how impossibly complicated and intricate can the learning algorithm be, really? I just don’t see it.
…And an even stronger argument IMO is in [Intro to brain-like-AGI safety] 2. “Learning from scratch” in the brain, especially the section on “cortical uniformity”, and parts of the subsequent post too.
Also, you said “the brain’s algorithm”, but I don’t expect the brain’s algorithm in its entirety to be understood until after superintelligence. For example there’s something in the brain algorithm that says exactly which muscles to contract in order to vomit. Obviously you can make brain-like AGI without reverse-engineering that particular bit of the brain algorithm. More examples in the “Brain complexity is easy to overstate” section here.
I’m surprised you think that the brain’s algorithm is SO simple that it must be discovered soon and ~all at once.
RE “soon”, my claim (§1.9) was “probably within 25 years” but not with overwhelming confidence.
RE “~all at once”, see §1.7.1 for a very important nuance on that.
I think your comment is poorly worded, in that you’re stating certain trend extrapolations as facts rather than hypotheses. But anyway, yes my position is that LLMs (including groups of LLMs) will be unable to autonomously write a business plan and then found a company and grow it to $1B/year revenue, all with zero human intervention, 25 years from now.
My discussion in §2.4.1 is about making fuzzy judgment calls using trained classifiers, which is not exactly the same as making fuzzy judgment calls using LLMs or humans, but I think everything I wrote still applies.
“completely and resoundingly fucked” is mildly overstated but mostly “Yes, that’s my position”, see §1.6.1, 1.6.2, 1.8.4.
I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.