I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Steven Byrnes
Thanks!
a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values
I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.”.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.
(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)
Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.
Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.
And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.
Humans addicted to drugs often exhibit weird meta-preferences like ‘I want to stop wanting the drug’, or ‘I want to find an even better kind of drug’.
“I want to stop wanting the drug” is downstream of the fact that people have lots of innate drives giving rise to lots of preferences, and the appeal of the drug itself is just one of these many competing preferences and drives.
However, I specified in §1 that if you’re going for Reward Button Alignment, you should zero out the other drives and preferences first. So that would fix the problem. That part is a bit analogous to clinical depression: all the things that you used to like—your favorite music, sitting at the cool kids table at school, satisfying curiosity, everything—just loses its appeal. So now if the reward button is strongly motivating, it won’t have any competition.
“I want to find an even better kind of drug” might be a goal generalization / misgeneralization thing. The analogy for the AGI would be to feel motivated by things somehow similar to the reward button being pressed. Let’s say, it wants other red buttons to be pressed. But then those buttons are pressed, and there’s no reward, and the AGI says “oh, that’s disappointing, I guess that wasn’t the real thing that I like”. Pretty quickly it will figure out that only the real reward button is the thing that matters.
Ah, but what if the AGI builds its own reward button and properly wires it up to its own reward channel? Well sure, that could happen (although it also might not). We could defend against it by cybersecurity protections, such that the AGI doesn’t have access to its own source code and reward channel. That doesn’t last to superintelligence, but we already knew that this plan doesn’t last to superintelligence, to say the least.
I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed.
I think you’re in a train-then-deploy mentality, whereas I’m talking about RL agents with continuous learning (e.g. how the brain works). So if the AGI has some funny idea about what is Good, generalized from idiosyncrasies of previous button presses, then it might try to make those things happen again, but it will find that the results are unsatisfying, unless of course the button was actually pressed. It’s going to eventually learn that nothing feels satisfying unless the button is actually pressed. And I really don’t think it would take very many repetitions for the AGI to figure that out.
It might want the button to be pressed in some ways but not others, it might try to self-modify (including hacking into its reward channel) as above, it might surreptitiously spin off modified copies of itself to gather resources and power around the world and ultimately help with button-pressing, all sorts of stuff could happen. But I’m assuming that these kinds of things are stopped by physical security and cybersecurity etc.
Recall that I did describe it as “a terrible plan” that should be thrown into an incinerator. We’re just arguing here about the precise nature of how and when it will fail.
Nice talk, and looking forward to the rest of the series!
They’re exposed to our discourse about consciousness.
For what it’s worth, I would strongly bet that if you purge all discussion of consciousness from an LLM training run, the LLMs won’t spontaneously start talking about consciousness, or anything of the sort.
(I am saying this specifically about LLMs; I would expect discussion-of-consciousness to emerge 100% from scratch in different AI algorithms.)
AIs maybe specifically crafted/trained to seem human-like and/or conscious
Related to this, I really love this post from 2 years ago: Microsoft and OpenAI, stop telling chatbots to roleplay as AI. The two sentence summary is: You can train an LLM to roleplay as a sassy talking pink unicorn, or whatever else. But what companies overwhelmingly choose to do is train LLMs to roleplay as LLMs.
Gradual replacement maybe proves too much re: recordings and look-up table … ambiguity/triviality about what computation a system implements, weird for consciousness to defend on counterfactual behavior …
I think Scott Aaronson has a good response to that, and I elaborate on the “recordings and look-up table” and “counterfactual” aspects in this comment.
I’m not sure exactly what you’re getting at, you might need to elaborate.
If dogs could understand English, and I said “I’ll give you a treat if you walk on your hind legs”, then the dog would try to walk on its hind legs. Alas, dogs do not understand English, so we need to resort to more annoying techniques like shaping. But people do understanding English, and so will AGIs, so we don’t need shaping, we can just do it the easy way.
Thanks! Part of it is that @TurnTrout was probably mostly thinking about model-free policy optimization RL (e.g. PPO), whereas I’m mostly thinking about actor-critic model-based RL agents (especially how I think the human brain works).
Another part of it is that
TurnTrout is arguing against “the AGI will definitely want the reward button to be pressed; this is universal and unavoidable”,
whereas I’m arguing for “if you want your AGI to want the reward button to be pressed, that’s something that you can make happen, by carefully following the instructions in §1”.
I think both those arguments are correct, and indeed I also gave an example (block-quote in §8) of how you might set things up such that the AGI wouldn’t want the reward button to be pressed, if that’s what you wanted instead.
I reject “intrinsic versus extrinsic motivation” as a meaningful or helpful distinction, but that’s a whole separate rant (e.g. here or here).
If you replaced the word “extrinsic” with “instrumental”, then now we have the distinction between “intrinsic versus instrumental motivation”, and I like that much better. For example, if I’m walking upstairs to get a sweater, I don’t particularly enjoy the act of walking upstairs for its own sake, I just want the sweater. Walking upstairs is instrumental, and it explicitly feels instrumental to me. (This kind of explicit self-aware knowledge that some action is instrumental is a thing in at least some kinds of actor-critic model-based RL, but not in model-free RL like PPO, I think.) I think that’s kinda what you’re getting at in your comment. If so, yes, the idea of Reward Button Alignment is to deliberately set up an instrumental motivation to follow instructions, whereas that TurnTrout post (or my §8 block quote) would be aiming at an intrinsic motivation to follow instructions (or to do such-and-such task).
I agree that setting things up such that an AGI feels an intrinsic motivation to follow instructions (or to do such-and-such task) would be good, and certainly way better than Reward Button Alignment, other things equal, although I think actually pulling that off is harder than you (or probably TurnTrout) seem to think—see my long discussion at Self-dialogue: Do behaviorist rewards make scheming AGIs?
Reward button alignment
I think I like the thing I wrote here:
To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.
By “world” I mean “reality” more broadly, possibly including the multiverse or whatever the agent cares about. So for example:
But I think he doesn’t think there’s anything wrong with e.g. wanting there to be diamonds in the stellar age and paperclips afterward, it just requires a (possibly disproportionally) more complex utility function.
This is still “preference purely over future states” by my definition. It’s important that timestamps during the course-of-action are not playing a big role in the decision, but it’s not important that there is one and only one future timestamp that matters. I still have consequentialist preferences (preferences purely over future states) even if I care about what the universe is like in both 3000AD and 4000AD.
If you were making any other point in that section, I didn’t understand it.
The part where you wrote “not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds” sounds to me like you’re invoking non-indexical preferences? (= preferences that make no reference to this-agent-in-particular.) If so, I don’t see any relation between non-indexical preferences and “preferences purely about future states”—I think all four quadrants are possible, and that things like instrumental convergence depend only on “preferences purely about future states”, independent of indexicality.
It seems to me quite plausible that this is how MIRI got started with corrigibility, and it doesn’t seem too different from what they wrote about on the shutdown button.
I think it’s fundamentally different, but I already made that argument in a previous comment. Guess we have to just agree to disagree.
I don’t think your objection that you would need to formalize pattern-matching to fuzzy time-extended concepts is reasonable. To the extent that the concepts humans use are incoherent, that is very worrying (e.g. if the helpfulness accessor is incoherent it will in the limit probably get money pumped somehow leaving the long-term outcomes be mainly based on the outcome-goal accessor). To the extent that the “the humans will remain in control” concept is coherent, the concepts are also just math…
I don’t think fuzzy time-extended concepts are necessarily “incoherent”, although I’m not sure I know what you mean by that anyway. I do think it’s “just math” (isn’t everything?), but like I said before, I don’t know how to formalize it, and neither does anyone else, and if I did know then I wouldn’t publish it because of infohazards.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I reject this way of talking, in this context. We shouldn’t use the passive voice, “preference that could be…optimized”. There is a particular agent which has the preferences and which is doing the optimization, and it’s the properties of this agent that we’re talking about. It will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.
Other than that, I think you were reading this post as a positive case that I have a plan that will work, instead of a rebuttal to an argument that this line of research is fundamentally doomed.
For example, if someone says they have a no-go argument that one cannot prove the Riemann hypothesis by (blah) type of high-level strategy, then it’s fine to rebut this argument, without knowing how to execute the high-level strategy, and while remaining open-minded that there is a different better no-go argument for the same conclusion.
I feel like the post says that a bunch of times, but I’m open to making edits if there’s any particular text that you think gives the wrong impression.
You think people don’t read books if they confidently disagree with the title? (Not rhetorical; I read books I confidently disagree with but I’m not an average book reader.)
What about people who aren’t coming in with a strong opinion either way? Isn’t that most potential readers, and the main target audience?
E.g. “The Myth of the Rational Voter” book title implies a strong claim that voters are not rational. If I had walked by that book on a bookshelf 15 years ago (before I knew anything about the topic or author), I imagine that I would have been intrigued and maybe bought it, not because I already confidently believed that voters are not rational but because, I dunno, it might have seemed interesting and fun to read, on a topic I didn’t already know much about, so maybe I’d learn something.
Thanks!
What’s the difference between “utilities over outcomes/states” and “utilities over worlds/universes”?
(I didn’t use the word “outcome” in my OP, and I normally take “state” to be shorthand for “state of the world/universe”.)
I didn’t intend for the word “consequentialist” to imply CDT, if that’s what your thinking. I intended “consequentialist” to be describing what an agent’s final preferences are (i.e., they’re preferences about the state of the world/universe in the future, and more centrally the distant future), whereas I think decision theory is about how to make decisions given one’s final preferences.
I think your proposal is way too abstract. If you think it’s actually coherent you should write it down in math.
I wrote: “For example, pause for a second and think about the human concept of “going to the football game”. It’s a big bundle of associations containing immediate actions, and future actions, and semantic context, and expectations of what will happen while we’re doing it, and expectations of what will result after we finish doing it, etc. We humans are perfectly capable of pattern-matching to these kinds of time-extended concepts, and I happen to expect that future AGIs will be as well.”
I think this is true, important, and quite foreign compared to the formalisms that I’ve seen from MIRI or Stuart Armstrong.
Can I offer a mathematical theory in which the mental concept of “going to the football game” comfortably sits? No. And I wouldn’t publish it even if I could, because of infohazards. But it’s obviously possible because human brains can do it.
I think “I am being helpful and non-manipulative”, or “the humans remain in control” could be a mental concept that a future AI might have, and pattern-match to, just as “going to the football game” is for me. And if so, we could potentially set things up such that the AI finds things-that-pattern-match-to-that-concept to be intrinsically motivating. Again, it’s a research direction, not a concrete plan. But I talk about it more at Plan for mediocre alignment of brain-like [model-based RL] AGI. For what it’s worth, I think I’m somewhat more skeptical of this research direction now than when I wrote that 2 years ago, more on which in a (hopefully) forthcoming post.
My personal guess would even be that MIRI tried pretty exactly that what you’re suggesting in this post…
MIRI was trying to formalize things, and nobody knows how to formalize what I’m talking about, involving pattern-matching to fuzzy time-extended learned concepts, per above.
Anyway, I’m reluctant to get in an endless argument about what other people (who are not part of this conversation) believe. But FWIW, The Problem of Fully Updated Deference is IMO a nice illustration of how Eliezer has tended to assume that ASIs will have preferences purely about the state of the world in the future. And also everything else he said and wrote especially in the 2010s, e.g. the one I cited in the post. He doesn’t always say it out loud, but if he’s not making that assumption, almost everything he says in that post is trivially false. Right? You can also read this 2018 comment where (among other things) he argues more generally that “corrigibility is anti-natural”; my read of it is that he has that kind of preference structure at the back of his mind, although he doesn’t state that explicitly and there are other things going on too (e.g. as usual I think Eliezer is over-anchored on the evolution analogy.)
Here’s a nice video of a teleoperated robot vacuuming, making coffee, cleaning a table, emptying the dishwasher, washing & drying & folding & hanging laundry, making a bed, etc. They also have a video where it cooks a meal. Pretty impressive! It’s surprising how much you can do without sensitive fingers!
Their website lists a bill of materials for the teleoperated robot of ≈$30K, or ≈$20K if hypothetically there were an AGI teleoperating it (because you wouldn’t need the teleoperation UI parts, or on-board laptop). It’s a one-off made by students.
Re SMTM: negative feedback on negative feedback
Say during keeper training the keeper was rewarded for thinking in productive ways, so the value function may have learned to supply valence for thinking in productive ways.
The way I currently think of it, it doesn’t matter which goal the keeper then attacks, because the value function still assigns high valence for thinking in those fun productive ways.
You seem to be in a train-then-deploy mindset, rather than a continuous-learning mindset, I think. In my view, the value function never stops being edited to hew closely to primary rewards. The minute the value function claims that a primary reward is coming, and then no primary reward actually arrives, the value function will be edited to not make that prediction again.
For example, imagine a person who has always liked listening to jazz, but right now she’s clinically depressed, so she turns on some jazz, but finds that it doesn’t feel rewarding or enjoyable. Not only will she turn the music right back off, but she has also learned that it’s pointless to even turn it on, at least when she’s in this mood. That would be a value function update.
Now, it’s possible that the Keeper 101 course was taught by a teacher who the trainee looked up to. Then the teacher said “X is good”, where X could be a metacognitive strategy, a goal, a virtue, or whatever. The trainee may well continue believing that X is good after graduation. But that’s just because there’s a primary reward related to social instincts, and imagining yourself as being impressive to people you admire. I agree that this kind of primary reward can support lots of different object-level motivations—cultural norms are somewhat arbitrary.
How do you imagine the value function might learn to assign negative valence to “pdoom=high”?
Could be the social copying thing I mentioned above, or else the person is thinking of one of the connotations and implications of pdoom that hooks into some other primary reward, like maybe they imagine the robot apocalypse will be physically painful, and pain is bad (primary reward), or doom will mean no more friendship and satisfying-curiosity, but friendship and satisfying-curiosity are good (primary reward), etc. Or more than one of the above, and/or different for different people.
I think your setup is weird for the context where people talk about utility functions, circular preferences, and money-pumping. “Having a utility function” is trivial unless the input to the utility function is something like “the state of the world at a certain time in the future”. So in that context, I think we should be imagining something like this:
And the stereotypical circular preference would be between “I want the world to be in State {A,B,C} at a particular future time T”.
I think you’re mixing up an MDP RL scenario with a consequentialist planning scenario? MDP RL agents can make decisions based on steering towards particular future states, but they don’t have to, and they often don’t, especially if the discount rate is high.
“AI agents that care about the state of the world in the future, and take actions accordingly, with great skill” are a very important category of AI to talk about because (1) such agents are super dangerous because of instrumental convergence, (2) people will probably make such agents anyway, because people care about the state of the world in the future, and also because such AIs are very powerful and impressive etc.
Yeah I don’t understand where Eliezer is coming from in that tweet (and he’s written similar things elsewhere).
I don’t know as much about algorithms as you; my argument centers instead on the complexity of the world.
Like, here a question: is “tires are usually black”, and stuff like that, encoded directly in the source code?
If yes, then (1) that’s unrealistic, and (2) even if it happened, the source code would wind up horrifically complicated and inscrutable, because it’s a complicated world.
If no, then the source code must be defining a learning algorithm of some sort, which in turn will figure out for itself that tires are usually black. Might this learning algorithm be simple and legible? Yes! But that was also true for GPT-3 too, which Eliezer has always put in the inscrutable category.
So what is he taking about? He must have some alternative in mind, but I’m unsure what.
Thanks, that’s very helpful!
If we divide the inventing-ASI task into (A) “thinking about and writing algorithms” versus (B) “testing algorithms”, in the world of today there’s a clean division of labor where the humans do (A) and the computers do (B). But in your imagined October 2027 world, there’s fungibility between how much compute is being used on (A) versus (B). I guess I should interpret your “330K superhuman AI researcher copies thinking at 57x human speed” as what would happen if the compute hypothetically all went towards (A), none towards (B)? And really there’s gonna be some division of compute between (A) and (B), such that the amount of (A) is less than I claimed? …Or how are you thinking about that?
I’m curious about your definition for importantly useful AI actually. Under some interpretations I feel like current AI should cross that bar.
Right, but I’m positing a discontinuity between current AI and the next paradigm, and I was talking about the gap between when AI-of-that-next-paradigm is importantly useful versus when it’s ASI. For example, AI-of-that-next-paradigm might arguably already exist today but where it’s missing key pieces such that it barely works on toy models in obscure arxiv papers. Or here’s a more concrete example: Take the “RL agent” line of AI research (AlphaZero, MuZero, stuff like that), which is quite different from LLMs (e.g. “training environment” rather than “training data”, and there’s nothing quite like self-supervised pretraining (see here)). This line of research has led to great results on board games and videogames, but it’s more-or-less economically useless, and certainly useless for alignment research, societal resilience, capabilities research, etc. If it turns out that this line of research is actually much closer to how future ASI will work at a nuts-and-bolts level than LLMs are (for the sake of argument), then we have not yet crossed the “AI-of-that-next-paradigm is importantly useful” threshold in my sense.
If it helps, here’s a draft paragraph from that (hopefully) forthcoming post:
Another possible counter-argument from a prosaic-AGI person would be: “Maybe this future paradigm exists, but LLM agents will find it, not humans, so this is really part of that ‘AIs-doing-AI-R&D’ story like I’ve been saying”. I have two responses. First, I disagree with that prediction. Granted, probably LLMs will be a helpful research tool involved in finding the new paradigm, but there have always been helpful research tools, from PyTorch to arXiv to IDEs, and I don’t expect LLMs to be fundamentally different from those other helpful research tools. Second, even if it’s true that LLMs will discover the new paradigm by themselves (or almost by themselves), I’m just not sure I even care. I see the pre-paradigm-shift AI world as a lesser problem, one that LLM-focused AI alignment researchers (i.e. the vast majority of them) are already focusing on. Good luck to them. And I want to talk about what happens in the strange new world that we enter after that paradigm shift.
Next:
If so, that implies extremely fast takeoff, correct? Like on the order of days from AI that can do important things to full-blown superintelligence?
Well, even if you have an ML training plan that will yield ASI, you still need to run it, which isn’t instantaneous. I dunno, it’s something I’m still puzzling over.
…But yeah, many of my views are pretty retro, like a time capsule from like AI alignment discourse of 2009. ¯\_(ツ)_/¯
Have you seen the classic parody article “On the Impossibility of Supersized Machines”?
I think it’s possible to convey something pedagogically useful via those kinds of graphs. They can be misinterpreted, but this is true of many diagrams in life. I do dislike the “you are here” dot that Tim Urban added recently.
Are you suggesting that e.g. “R&D Person-Years 463205–463283 go towards ensuring that the AI has mastery of metallurgy, and R&D Person-Years 463283–463307 go towards ensuring that the AI has mastery of injection-molding machinery, and …”?
If no, then I don’t understand what “the world is complicated” has to do with “it takes a million person-years of R&D to build ASI”. Can you explain?
…Or if yes, that kind of picture seems to contradict the facts that:
This seems quite disanalogous to how LLMs are designed today (i.e., LLMs can already answer any textbook question about injection-molding machinery, but no human doing LLM R&D has ever worked specifically on LLM knowledge of injection-molding machinery),
This seems quite disanalogous to how the human brain was designed (i.e., humans are human-level at injection-molding machinery knowledge and operation, but Evolution designed human brains for the African Savannah, which lacked any injection-molding machinery).
Fun fact: AI-2027 estimates that getting to ASI might take the equivalent of a 100-person team of top human AI research talent working for tens of thousands of years.
I’m curious why ASI would take so much work. What exactly is the R&D labor supposed to be doing each day, that adds up to so much effort? I’m curious how people are thinking about that, if they buy into this kind of picture. Thanks :)
(Calculation details: For example, in October 2027 of the AI-2027 modal scenario, they have “330K superhuman AI researcher copies thinking at 57x human speed”, which is 1.6 million person-years of research in that month alone. And that’s mostly going towards inventing ASI, I think. Did I get that right?)
(My own opinion, stated without justification, is that LLMs are not a paradigm that can scale to ASI, but after some future AI paradigm shift, there will be very very little R&D separating “this type of AI can do anything importantly useful at all” and “full-blown superintelligence”. Like maybe dozens or hundreds of person-years, or whatever, as opposed to millions. More on this in a (hopefully) forthcoming post.)
It’s pretty arbitrary, I tried to explain this point via a short fictional story here.
Gaussian units only has M,L,T base units, with nothing extra for electromagnetism.
There are practical tradeoffs involved in how many units you use—basically adding units gives you more error-checking at the expense of more annoyance. See the case of radians that I discuss here.