Thoughts on “AI is easy to control” by Pope & Belrose
Quintin Pope & Nora Belrose have a new “AI Optimists” website, along with a new essay “AI is easy to control”, arguing that the risk of human extinction due to future AI (“AI x-risk”[1]) is a mere 1% (“a tail risk worth considering, but not the dominant source of risk in the world”). (I’m much more pessimistic.) It makes lots of interesting arguments, and I’m happy that the authors are engaging in substantive and productive discourse, unlike the ad hominem vibes-based drivel which is growing increasingly common on both sides of the AI x-risk issue in recent months.
This is not a comprehensive rebuttal or anything, but rather picking up on a few threads that seem important for where we disagree, or where I have something I want to say.
(Note: Nora has a reply here.)
Summary / table-of-contents:
Note: I think Sections 1 & 4 are the main reasons that I’m much more pessimistic about AI x-risk than Pope & Belrose, whereas Sections 2 & 3 are more nitpicky.
Section 1 argues that even if controllable AI has an “easy” technical solution, there are still good reasons to be concerned about AI takeover, because of things like competition and coordination issues, and in fact I would still be overall pessimistic about our prospects.
Section 2 talks about the terms “black box” versus “white box”.
Section 3 talks about what if anything we learn from “human alignment”, including some background on how I think about human innate drives.
Section 4 argues that pretty much the whole essay would need to be thrown out if future AI is trained in a substantially different way from current LLMs. If this strikes you as a bizarre unthinkable hypothetical, yes I am here to tell you that other types of AI do actually exist, and I specifically discuss the example of “brain-like AGI” (a version of actor-critic model-based RL), spelling out a bunch of areas where the essay makes claims that wouldn’t apply to that type of AI, and more generally how it would differ from LLMs in safety-relevant ways.
1. Even if controllable AI has an “easy” technical solution, I’d still be pessimistic about AI takeover
Most of Pope & Belrose’s essay is on the narrow question of whether the AI control problem has an easy technical solution. That’s great! I’m strongly in favor of arguing about narrow questions. And after this section I’ll be talking about that narrow question as well. But the authors do also bring up the broader question of whether AI takeover is likely to happen, all things considered. These are not the same question; for example, there could be an easy technical solution, but people don’t use it.
So, for this section only, I will assume for the sake of argument that there is in fact an easy technical solution to the AI control and/or alignment problem. Unfortunately, in this world, I would still think future catastrophic takeover by out-of-control AI is not only plausible but likely.
Suppose someone makes an AI that really really wants something in the world to happen, in the same way a person might really really want to get out of debt, or Elon Musk really really wants for there to be a Mars colony—including via means-end reasoning, out-of-the-box solutions, inventing new tools to solve problems, and so on. Then, if that “want” is stronger than the AI’s desires and habits for obedience and norm-following (if any), and if the AI is sufficiently capable, then the natural result would be an AI that irreversibly escapes human control—see instrumental convergence.
But before we get to that, why might we suppose that someone might make an AI that really really wants something in the world to happen? Well, lots of reasons:
People have been trying to do that since the dawn of AI.
Humans often really really want something in the world to happen (e.g., for there to be more efficient solar cells, for my country to win the war, to make lots of money, to do a certain very impressive thing that will win fame and investors and NeurIPS papers, etc.), and one presumes that some of those humans will reason “Well, the best way to make X happen is to build an AI that really really wants X to happen”. You and I might declare that these people are being stupid, but boy, people do stupid things every day.
As AI advances, more and more people are likely to have an intuition that it’s unethical to exclusively make AIs that have no rights and are constitutionally subservient with no aspirations of their own. This is already starting to happen. I’ll put aside the question of whether or not that intuition is justified.
Some people think that irreversible AGI catastrophe cannot possibly happen regardless of the AGI’s motivations and capabilities, because of [insert stupid reason that doesn’t stand up to scrutiny], or will be prevented by [poorly-thought-through “guardrail” that won’t actually work]. One hopes that the number of such people will go down with time, but I don’t expect it to go to zero.
Some people want to make AGI as capable and independent as possible, even if it means that humanity will go extinct, because “AI is the next step of evolution” or whatever. Mercifully few people think that! But they do exist.
Sometimes people do things just to see what would happen (cf. chaos-GPT).
So now I wind up with a strong default assumption that the future world will have both AIs under close human supervision and out-of-control consequentialist AIs ruthlessly seeking power. So, what should we expect to happen at the end of the day? It depends on offense-defense balance, and regulation, and a host of other issues. This is a complicated topic with lots of uncertainties and considerations on both sides. As it happens, I lean pessimistic that humanity will survive; see my post What does it take to defend the world against out-of-control AGIs? for details. Again, I think there’s a lot of uncertainty, and scope for reasonable people to disagree—but I don’t think one can think carefully and fairly about this topic and wind up with a probability as low as 1% that there will ever be a catastrophic AI takeover.
2. Black-and-white (box) thinking
The authors repeat over and over that AIs are “white boxes” unlike human brains which are “black boxes”. I was arguing with Nora about this a couple months ago, and Charles Foster also chimed in with a helpful perspective on twitter, arguing convincingly that the terms “white box” and “black box” are used differently in different fields. My takeaway is: I’m sick of arguing about this and I really wish everybody would just taboo the words “black box” and “white box”—i.e., say whatever you want to say without using those particular words.
So, here are two things that I hope everybody can agree with:
(A) Hopefully everyone on all sides can agree that if my LLM reliably exhibits a certain behavior—e.g. it outputs “apple” after a certain prompt—and you ask me “Why did it output ‘apple’, rather than ‘banana’?”, then it might take me decades of work to give you a satisfying intuitive answer. In this respect, LLMs are different from many other engineered artifacts[2] such as bridges and airplanes. For example, if an airplane reliably exhibits a certain behavior (let’s say, it tends to pitch left in unusually low air pressure), and you ask me “why does it exhibit that behavior?” then it’s a safe bet that the airplane designers could figure out a satisfying intuitive answer pretty quickly (maybe immediately, maybe not, but certainly not decades). Likewise, if a non-ML computer program, like the Linux kernel, reliably exhibits a certain behavior, then it’s a safe bet that there’s a satisfying intuitive answer to “why does the program do that”, and that the people who have been writing and working with the source code could generate that answer pretty quickly, often in minutes. (There is such a thing as a buggy behavior that takes many person-years to understand, but they make good stories partly because they are so rare.)
(B) Hopefully everyone on all sides can agree that if you train an LLM, then you can view any or all the billions of weights and activations, and you can also perform gradient descent on the weights. In this respect, LLMs are different from biological intelligence, because biological neurons are far harder to measure and manipulate experimentally. Even mice have orders of magnitude too many neurons to measure and manipulate the activity of all of them in real time, and even if you could, you certainly couldn’t perform gradient descent on an entire mouse brain.
Again, I hope we can agree on those two things (and similar), even if we disagree about what those facts imply about AI x-risk. For the record, I don’t think either of the above bullet points by itself should be sufficient to make someone feel optimistic or pessimistic about AI x-risk. But they can be an ingredient in a larger argument. So can we all stop arguing about whether LLMs are “black box” or “white box”, and move on to the meaningful stuff, please?
3. What lessons do we learn from “human alignment” (such as it is)?
The authors write:
If we could observe and modify everything that’s going on in a human brain, we’d be able to use optimization algorithms to calculate the precise modifications to the synaptic weights which would cause a desired change in behavior. Since we can’t do this, we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults. We provide role models for children to imitate, along with rewards and punishments that are tailored to their innate, evolved drives. In essence, we are poking and prodding at the human brain’s learning algorithms from the outside, instead of directly engineering those learning algorithms.
It’s striking how well these black box alignment methods work: most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social. But human alignment is also highly imperfect. Lots of people are selfish and anti-social when they can get away with it, and cultural norms do change over time, for better or worse. Black box alignment is unreliable because there is no guarantee that an intervention intended to change behavior in a certain direction will in fact change behavior in that direction. Children often do the exact opposite of what their parents tell them to do, just to be rebellious.
I think there are two quite different stories here which are confusingly tangled together.
Story 1: “Humans have innate, evolved drives that lead to them wanting to be prosocial, fit into their culture, imitate role models, etc., at least to some extent.”
Story 2: “Human children are gradually sculpted into kind and productive adults by parents and society providing rewards and punishments, and controlling their life experience in other ways.”
I basically lean towards Story 1 for reasons in my post Heritability, Behaviorism, and Within-Lifetime RL.
There are some caveats—e.g. parents can obviously “sculpt” arbitrary children into unkind and unproductive adults by malnourishing them, or by isolating them from all human contact, or by exposing them to lead dust, etc. But generally, the sentence “we are forced to resort to crude and error-prone tools for shaping young humans into kind and productive adults” sounds almost as absurd to my ears as if the authors had written “we are forced to resort to crude and error-prone tools for shaping young humans into adults that have four-chambered hearts”. The credit goes to evolution, not “us”.
So what does that imply for AI x-risk? I don’t know, this is a few steps removed. But it brings us to the subject of “human innate drives”, a subject close to my (four-chambered) heart. I think the AGI-safety-relevant part of human innate drives—the part related to compassion and so on—is the equivalent of probably hundreds of lines of pseudocode, and nobody knows what they are. I think it would be nice if we did, and that happens to be a major research interest of mine. If memory serves, Quintin has kindly wished me luck in figuring this out. But the article here seems to strongly imply that it hardly matters, as we can easily get AI alignment and control regardless.
Instead, the authors make a big deal out of the fact that human innate drives are relatively simple (I think they mean “simple compared to a modern big trained ML model”, which I would agree with). I’m confused why that matters. Who cares if there’s a simple solution, when we don’t know what it is?
I think maybe the idea is that we’re approximating human innate drives via the RLHF reward function, so the fact that human innate drives are simple should give us confidence that the RLHF reward function (with its comparatively abundant amount of free parameters and training data) will accurately capture human innate drives? If so, I strongly disagree with the premise: The RLHF reward function is not approximating human innate drives. Instead it’s approximating the preferences of human adults, which are not totally unrelated to human innate drives, but sure aren’t the same thing. For example, here’s what an innate drive might vaguely look like for laughter—it’s this weird thing involving certain hormone levels and various poorly-studied innate signaling pathways in the hypothalamus (if I’m right). Compare that to a human adult’s sense of humor. The RLHF reward function is approximately capturing the latter (among many other things), but it has little relation to the former.
Again, so what? Is this an argument for AI doom? No. I’m making a narrow argument against some points raised in this post. If you want to argue that the RLHF reward function does a good job of capturing the preferences of human adults, then by all means, make that argument directly. I might even agree. But let’s leave human innate drives out of it.
4. Can we all agree in advance to disavow this whole “AI is easy to control” essay if future powerful AI is trained in a meaningfully different way from current LLMs?
My understanding is that the authors expect the most powerful future AI training approaches to be basically similar to what’s used in today’s Large Language Models—autoregressive prediction of human-created text and/or other data, followed by RLHF fine-tuning or similar.
As it happens, I disagree. But if the authors are right, then … I don’t know. “AI x-risk in the scenario that future transformative AI is trained in a similar way as current LLMs” is not really my area of expertise or interest. I don’t expect that scenario to actualize, so I have difficulty thinking clearly about it—like if someone says to me “Imagine a square circle, and now answer the following questions about it…”. Anyway, if we’re specifically talking about future AI whose training is basically the same as modern LLMs, then a lot of the optimistic takes in the essay would seem pretty plausible to me. But I also often read more pessimistic narratives, and those takes sound pretty plausible to me too!! I don’t really know how I feel. I’ll step aside and leave that debate to others.
So anyway, if the authors think that future transformative AI will be trained much like modern LLMs, then that’s a fine thing for them to believe—even if I happen to disagree. Lots of reasonable people believe that. And I think these authors in particular believe it for interesting and well-considered reasons, not just “ooh, chatGPT is cool!”. I don’t want to argue about that—we’ll find out one way or the other, sooner or later.
But it means that the post is full of claims and assumptions that are valid for current LLMs (or for future AI which is trained in basically the same way as current LLMs) but not for other kinds of AI. And I think this is not made sufficiently clear. In fact, it’s not even mentioned.
Why is this a problem? Because there are people right now trying to build transformative AI using architectures and training approaches that are quite different from LLMs, in safety-relevant ways. And they are reading this essay, and they are treating it as further confirmation that what they’re doing is fine and (practically) risk-free. But they shouldn’t! This essay just plain doesn’t apply to what those people are doing!! (For a real-life example of such a person, see here & here.)
So I propose that the authors should state clearly and repeatedly that, if the most powerful future AI is trained in a meaningfully different way from current LLMs, then they disavow their essay (and, I expect, much of the rest of their website). If the authors are super-confident that that will never happen, because LLM-like approaches are the future, then such a statement would be unimportant—they’re really not conceding anything, from their own perspective. But it would be really important from my perspective!
4.1 Examples where the essay is making points that don’t apply to “brain-like AGI” (≈ actor-critic model-based RL)
I’ll leave aside the many obvious examples throughout the essay where the authors use properties of current LLMs as direct evidence about the properties of future powerful AI. Here are some slightly-less-obvious examples:
Since AIs are white boxes, we have full control over their “sensory environment” (whether that consists of text, images, or other modalities).
As a human, I can be sitting in bed, staring into space, and I can think a specific abstruse thought about string theory, and now I’ve figured out something important. If a future AI can do that kind of thing, as I expect, then it’s not so clear that “controlling the AI’s sensory environment” is really all that much control.
If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.
A human can harbor a secret desire for years, never acting on it, and their brain won’t necessarily overwrite that desire, even as they think millions of thoughts in the meantime. So evidently, the argument above is inapplicable to human brains. An interesting question is, where does it go wrong? My current guess is that the main problem is that the “desires” of actor-critic RL agents are (by necessity) mainly edited by TD learning, which I think of as generally a much cruder tool than gradient descent.
We can run large numbers of experiments to find the most effective interventions, and we can also run it in a variety of simulated environments and test whether it behaves as expected both with and without the cognitive intervention. Each time the AI’s “memories” can be reset, making the experiments perfectly reproducible and preventing the AI from adapting to our actions, very much unlike experiments in psychology and social science.
That sounds nice, but brain-like AGI (like most RL agents) does online learning. So if you run a bunch of experiments, then as soon as the AGI does anything whatsoever (e.g. reads the morning newspaper), your experiments are all invalid (or at least, open to question), because now your AGI is different than it was before (different ANN weights, not just different environment / different prompt). Humans are like that too, but LLMs are not.
When it comes to AIs, we are the innate reward system.
I have no idea how I’m supposed to interpret this sentence for brain-like AGI, such that it makes any sense at all. Actually, I’m not quite sure what it means even for LLMs!
4.2 No, “brain-like AGI” is not trained similarly to LLMs
This seems really obvious to me, but evidently it’s controversial, so let’s walk through some example differences. None of these are trying to prove some point about AI alignment and control being easy or hard; instead I am making the narrow point that the safety/danger of future LLMs is a different technical question than the safety/danger of hypothetical future brain-like AGI.
Brains can imitate, but do so in a fundamentally different way from LLM pretraining. Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.
Relatedly, brains have a distinction between expectations and desires, cleanly baked into the algorithms. I think this is obvious common sense, leaving aside galaxy-brain Free-Energy-Principle takes which try to deny it. By contrast, there isn’t any distinction between “the LLM expects the next token to be ‘a’” and “the LLM wants the next token to be ‘a’”. (Or if there is, it’s complicated and emergent and controversial, rather than directly and straightforwardly reflected in the source code, as I claim it would be in brain-like AGI.) So this is another disanalogy, and one with obvious relevance to technical arguments about safety.
In brains, online learning (editing weights, not just context window) is part of problem-solving. If I ask a smart human a hard science question, their brain may chug along from time t=0 to t=10 minutes, as they stare into space, and then out comes an answer. After that 10 minutes, their brain is permanently different than it was before (i.e., different weights)—they’ve figured things out about science that they didn’t previously know. Not only that, but the online-learning (weight editing) that they did during time 0<t<5 minutes is absolutely critical for the further processing that happens during time 5<t<10 minutes. This is not how today’s LLMs work—LLMs don’t edit weights in the course of “thinking”. I think this is safety-relevant for a number of reasons, including whether we can expect future AI to get rapidly more capable in an open-ended way without new human-provided training data (related discussion).
I want to reiterate that I’m delighted that people are contingency-planning for the possibility that future transformative AI will be LLM-like. We should definitely be doing that. But we should be very clear that that’s what we’re doing.
- ^
“x-risk” is not quite synonymous with “extinction risk”, but they’re close enough in the context of the stuff I’m talking about here.
- ^
Note that I said “many other engineered artifacts”, not “all”. The other examples I can think of tend to be in biology. For example, if I selectively breed a cultivar of cabbage that has lower irrigation needs, and I notice that its stalks are all a weird color, I may have no idea why, and it may take a decades-long research project to figure it out. Or as another example, there are many pharmaceutical drugs that are effective, but where nobody knows why they are effective, even after extraordinary efforts to figure it out.
- AI #41: Bring in the Other Gemini by 7 Dec 2023 15:10 UTC; 46 points) (
- Goals selected from learned knowledge: an alternative to RL alignment by 15 Jan 2024 21:52 UTC; 41 points) (
- 4 Aug 2024 15:41 UTC; 5 points) 's comment on A Simple Toy Coherence Theorem by (
- The benefits and risks of optimism (about AI safety) by 3 Dec 2023 12:45 UTC; 3 points) (EA Forum;
- 7 Dec 2023 10:58 UTC; 3 points) 's comment on Why Yudkowsky is wrong about “covalently bonded equivalents of biology” by (
- The benefits and risks of optimism (about AI safety) by 3 Dec 2023 12:45 UTC; -7 points) (
(Didn’t consult Quintin on this; I speak for myself)
I flatly deny that our arguments depend on AGI being anything like an LLM. I think the arguments go through in a very wide range of scenarios, basically as long as we’re using some kind of white-box optimization to align them, rather than e.g. carrot-and-stick incentives or prompt engineering. Even if we only relied on prompt engineering, I think we’d be in a better spot than with humans (because we can run many controlled experiments).
I’m pretty confused by this claim. Why should we expect the human reward system to overwrite all secret desires? Also how do we know it’s not doing that? Your desires are just causal effects of a bunch of stuff including your reward circuitry.
This is just generally a pretty weak argument. You don’t seem to be contesting the fact that we have full sensory control for AI and we don’t have full sensory control for humans. It’s just a claim that this doesn’t matter. Maybe this ends up being a brute clash of intuitions, but it seems obvious to me that full sensory control matters a lot, even if the AI is doing a lot of long running cognition without supervision.
With AI we can choose to cut its reasoning short whenever we want, force it to explain itself in human language, roll it back to a previous state, etc. We just have a lot more control over this ongoing reasoning process for AIs and it’s baffling to me that you seem to think this mostly doesn’t matter.
You can just include online learning in your experimentation loop. See what happens when you let the AI online learn for a bit in different environments. I don’t think online learning changes the equation very much. It’s known to be less stable than offline RL, but that instability hurts capabilities as well as alignment, so we’d need a specific argument that it will hurt alignment significantly more than capabilities, in ways that we wouldn’t be able to notice during training and evaluation.
It just means we are directly updating the AI’s neural circuitry with white box optimizers. This will be true across a very wide range of scenarios, including (IIUC) your brain-like AGI scenario.
I don’t see why any of the differences you listed are relevant for safety.
I basically deny this, especially if you’re stipulating that it’s a “clean” distinction. Obviously folk psychology has a fuzzy distinction between beliefs and desires in it, but it’s also well-known both in common sense and among neuroscientists etc. that beliefs and desires get mixed up all the time and there’s not a particularly sharp divide.
What about Section 1?
Our 1% doom number excludes misuse-flavored failure modes, so I considered it out of scope for my response. I think the fact that good humans have been able to keep rogue bad humans more-or-less under control for millennia is strong evidence that good AIs will be able to keep rogue AIs under control, and I think the evidence is pretty mixed on whether the so-called offense-defense balance will be skewed toward offense or defense— I weakly expect defense will be preferred, mainly through centralization-of-power effects.
It’s not strong evidence; it’s a big mess, and it seems really difficult to have any kind of confidence in such a fast-changing world. It feels to me that it’s going to be a roughly 50⁄50 bet. Saying the probability is 1% requires much more work that I’m not seeing, even if I appreciate what you are putting up.
On the offense-defense balance, there is no clear winner in the comment sections here, neither here. We’ve already seen a takeover between two different roughly equal human civilizations (see the story of the conquistadors) under certain circumstances. And AGI is at least more dangerous than nuclear weapons, and we came pretty close to nuclear war several times. Covid seems to come from gain of function research, etc...
On fast vs slow takeoff, it seems to me that fast takeoff breaks a lot of your assumptions, and I would assign much more than a 1% probability for fast takeoff. Even when you still embrace the compute-centric framework (which I find conservative), you still get wild numbers, like a two-digits probability of takeoff lasting less than a year. If so, we won’t have the time to implement defense strategies.
I don’t think it makes sense to “revert to a uniform prior” over {doom, not doom} here. Uniform priors are pretty stupid in general, because they’re dependent on how you split up the possibility space. So I prefer to stick fairly close to the probabilities I get from induction over human history, which tell me p(doom from unilateral action) << 50%
I strongly disagree that AGI is “more dangerous” than nukes; I think this equivocates over different meanings of the term “dangerous,” and in general is a pretty unhelpful comparison.
I find foom pretty ludicrous, and I don’t see a reason to privilege the hypothesis much.
From the linked report:
I just agree with this (if “significantly” means like 5x or something), but I wouldn’t call it “foom” in the relevant sense. It just seems orthogonal to the whole foom discussion.
I’m not using a uniform prior; the 50⁄50 thing is just me expressing my views, all things considered.
I’m using a decomposition of the type:
Does it want to harm us? Yes, because of misuses, ChaosGPT, wars, psychopaths firing in schools, etc...
Can it harm us? This is really hard to tell.
Okay. Let’s be more precise: “An AGI that has the power to launch nukes is at least more powerful than nukes.” Okay, and now, how would AGI acquire this power? That doesn’t seem that hard in the present world. You can bribe/threaten leaders, use drones to kill a leader during a public visit, and then help someone to gain power and become your puppet during the period of confusion à la conquistadors. The game of thrones is complex and brittle; this list of coups is rather long, and the default for a civilization/family reigning in some kingdom is to be overthrown.
I don’t like the word “doom”. I prefer to use the expression ‘irreversibly messed up future’, inspired by Christiano’s framing (and because of anthropic arguments, it’s meaningless to look at past doom events to compute this proba).
I’m really not sure what should be the reference class here. Yes, you are still living and the human civilization is still here but:
Napoleon and Hitler are examples of unilateral actions that led to international wars.
If you go from unilateral action to multilateral actions, and you allow stuff like collusion, things become easier. And collusion is not that wild, we already see this in Cicero: the AI playing as France, conspired with Germany to trick England.
As the saying goes: “AI is a wonderful tool for the betterment of humanity; AGI is a potential successor species.” So maybe the reference class is more something like chimps, neanderthals or horses. Another reference class could be something like Slave rebellion.
We don’t need the strict MIRI-like RSI foom to get in trouble. I’m saying if AI technology does not have the time to percolate in the economy, we won’t have the time to upgrade our infrastructure and add much more defense than what we have today, which seems to be the default.
I disagree; anthropics is pretty normal (https://www.lesswrong.com/posts/uAqs5Q3aGEen3nKeX/anthropics-is-pretty-normal)
Why? Like, what law of nature says that the trend in this terms should continue?
Game theory
Yes, but available strategies can change for AI vs humans—why assume they will be the same?
Induction from history depends on it’s interpretation—we have more information than 1111111111 over {bad, not-so-bad}. It just feels like at present point the crux between optimists and doomers is not about whether white box access or trained mind-space is better, about how much it all updates you from what prior.
What follows is a note I wrote responding to the AI Optimists essay, explaining where I agree and disagree. I was thinking about posting this somewhere, so I figure I’ll leave it in the comments here. (So to be clear, it’s responding to the AI Optimists essay, not responding to Steven’s post.)
Places I think AI Optimists and I agree:
We have a number of advantages for aligning NNs that we don’t have for humans: white box access, better control over training environments and inputs, better control over the reward signal, and better ability to do research about which alignment techniques are most effective.
Evolution is a misleading analogy for many aspects of the alignment problem; in particular, gradient-based optimization seems likely to have importantly different training dynamics from evolution, like making it harder to gradient hack your training process into retaining cognition which isn’t directly useful for producing high-reward outputs during training.
Humans end up with learned drives, e.g. empathy and revenge, which are not hard-coded into our reward systems. AI systems also have not-strictly-optimal-for-their-training-signal learned drives like this.
It shouldn’t be difficult for AI systems to faithfully imitate human value judgements and uncertainty about those value judgements.
Places I think we disagree, but I’m not certain. The authors of the Optimists article promise a forthcoming document which addresses pessimistic arguments, and these bullet points are something like like “points I would like to see addressed in this document.”
I’m not sure we’re worrying about the same regimes.
The regime I’m most worried about is:
AI systems which are much smarter than the smartest humans
These AI systems are aligned in a controlled lab environment, but then deployed into the world at-large. Many of their interactions are difficult to monitor (and are also interactions with other AI systems).
Possibly: these AI systems are highly multi-modal, including sensors which look like “camera readouts of real-world data”
It’s unclear to me whether the authors are discussing alignment in a regime like the one above, or a regime like “LLMs which are not much smarter than the smartest humans.” (I too am very optimistic about remaining safe in this latter regime.)
When they write things like “AIs are white boxes, we have full control over their ‘sensory environment’,” it seems like they’re imagining the latter regime.
They’re not very clear about what intelligence regime they’re discussing, but I’m guessing they’re talking about the ~human-level intelligence regime (e.g. because they don’t spill much ink discussing scalable oversight problems; see below).
I worry that the difference between “looks good to human evaluators” and “what human evaluators actually want” is important.
Concretely, I worry that training AI systems to produce outputs which look good to human evaluators will lead to AI systems which learn to systematically deceive their overseers, e.g. by introducing subtle errors which trick overseers into giving a too-high score, or by tampering with the sensors that overseers use to evaluate model outputs.
Note that arguments about the ease of learning human values and NN inductive biases don’t address this point — if our reward signal systematically prefers goals like “look good to evaluators” over goals like “actually be good,” then good priors won’t save us.
(Unless we do early stopping, in which case I want to hear a stronger case for why our models’ alignment will be sufficiently robust (robust enough that we’re happy to stop fine-tuning) before our models have learned to systematically deceive their overseers.)
I worry about sufficiently situationally aware AI systems learning to fixate on reward mechanisms (e.g. “was the thumbs-up button pressed” instead of “was the human happy”).
To sketch this concern out concretely, suppose an AI system is aware that it’s being fine-tuned and learned during pretraining that human overseers have a “thumbs-up” button which determines whether the model is rewarded. Suppose that so far during fine-tuning “thumbs-up button was pressed” and “human was happy” were always perfectly correlated. Will the model learn to form values around the thumbs-up button being pressed or around humans being happy? I think it’s not obvious.
Unlike before, NN inductive biases are relevant here. But it’s not clear to me that “humans are happy” will be favored over “thumbs-up button is pressed” — both seem similarly simple to an AI with a rich enough world model.
I don’t think the comparison with humans here is especially a cause for optimism: lots of humans get addicted to things, which feels to me like “forming drives around directly intervening on reward circuitry.”
For both of the above concerns, I worry that they might emerge suddenly with scale.
As argued here, “trick the overseer” will only be selected for in fine-tuning once the (pretrained) model is smart enough to do it well.
You can only form values around the thumbs-up button once you know it exists.
It seems to me that, on the authors’ view, an important input to “human alignment” is the environment that we’re trained in (rather than details of our brain’s reward circuitry, which is probably very simple). It doesn’t seem to me that environmental factors that make humans aligned (with each other) should generalize to make AI systems aligned (with humans).
In particular, I would guess that one important part of our environment is that humans need to interact with lots of similarly-capable humans, so that we form values around cooperation with humans. I also expect AI systems to interact with lots of AI systems (though not necessarily in training), which (if this analogy holds at all) would make AI systems care about each other, not about humans.
I neither have high enough confidence in our understanding of NN inductive biases, nor in the way Quintin/Nora make arguments based on said understanding, to consider these arguments as strong evidence that models won’t “play the training game” while they know they’re being trained/evaluated only to, in deployment, pursue goals they hid from their overseers.
I don’t really want to get into this, because it’s thorny and not my main source of P(doom).
A specific critique about the article:
The authors write “Some people point to the effectiveness of jailbreaks as an argument that AIs are difficult to control. We don’t think this argument makes sense at all, because jailbreaks are themselves an AI control method.” I don’t really understand this point.
The developer wanted their model to be sufficiently aligned that it would, e.g. never say racist stuff no matter what input it saw. In contrast, it takes only a little bit of adversarial pressure to produce inputs which will make the model say racist stuff. This indicates that the developer failed at alignment. (I agree that it means that the attacker succeeded at alignment.)
Part of the story here seems to be that AI systems have held-over drives from pretraining (e.g. drives like “produce continuations that look like plausible web text”). Eliminating these undesired drives is part of alignment.
The point is that it requires a human to execute the jailbreak, the AI is not the jailbreaker, and the examples show that humans can still retain control of the model.
The AI is not jailbreaking itself, here.
This link explains it better than I can, here:
https://www.aisnakeoil.com/p/model-alignment-protects-against
Just wanted to mention that, though this is not currently the case, there are two instances I can currently think of where the AI can be a jailbreaker:
Jailbreaking the reward model to get a high score. (Toy-ish example here.)
Autonomous AI agents embedded within society jailbreak other models to achieve a goal/sub-goal.
Yep, I’d really like people to distinguish between misuse and misalignment a lot more than people do currently, because they require quite different solutions.
The AI Optimists don’t make this argument AFAICT, but I think optimism about effectively utilizing “human level” models should transfer to a considerable amount of optimism about smarter than human models due to the potential for using these “human level” systems to develop considerably better safety technology (e.g. alignment research). AIs might have structural advantages (speed, cost, and standardization) which make it possible heavily accelerate R&D[1] even at around qualitatively “human level” capabilities. (That said, my overall view is that even if we had the exact human capability profile while also having ML structural advantages these systems would themselves pose substantial (e.g. 15%) catastrophic misalignment x-risk on the “default” trajectory because we’ll want to run extremely large numbers of these systems at high speeds.)
The idea of using human level models like this has a bunch of important caveats which mean you shouldn’t end up being extremely optimistic overall IMO[2]:
Is massive effective acceleration enough? We need safety technology to keep up with capabilites and capabilities might also be accelerated. There is the potential for arbitrarily scalable approaches to safety which should make us somewhat more optimistic. But, it might end up being the case that to avoid catastrophe from AIs which are one step smarter than humans we need the equivalent of having the 300 best safety researchers work for 500 years and we won’t have enough acceleration and delay to manage this. (In practice I’m somewhat optimistic here so long as we can get a 1-3 year delay at a critical point.)
Will “human level” systems be sufficiently controlled to get enough useful work? Even if systems could hypothetically be very useful, it might be hard to quickly get them actually doing useful work (particularly in fuzzy domains like alignment etc.). This objection holds even if we aren’t worried about catastrophic misalignment risk.
At least R&D which isn’t very limited by physical processes.
I think <1% doom seems too optimistic without more of a story for how we’re going to handle super human models.
Plans that rely on aligned AGIs working on alignment faster than humans would need to ensure that no AGIs work on anything else in the meantime. The reason humans have no time to develop alignment of superintelligence is that other humans develop misaligned superintelligence faster. Similarly by default very fast AGIs working on alignment end up having to compete with very fast AGIs working on other things that lead to misaligned superintelligence. Preventing aligned AGIs from building misaligned superintelligence is not clearly more manageable than preventing humans from building AGIs.
This isn’t true. It could be that making an arbitrarily scalable solution to alignment takes X cognitive resources and in practice building an uncontrollably powerful AI takes Y cognitive resources with X < Y.
(Also, this plan doesn’t require necessarily aligning “human level” AIs, just being able to get work out of them with sufficiently high productivity and low danger.)
I’m being a bit simplistic. The point is that it needs to stop being a losing or a close race, and all runners getting faster doesn’t obviously help with that problem. I guess there is some refactor vs. rewrite feel to the distinction between the project of stopping humans from building AGIs right now, and the project of getting first AGIs to work on alignment and global security in a post-AGI world faster than other AGIs overshadow such work. The former has near/concrete difficulties, the latter has nebulous difficulties that don’t as readily jump to attention. The whole problem is messiness and lack of coordination, so starting from scratch with AGIs seems more promising than reforming human society. But without strong coordination on development and deployment of first AGIs, the situation with activities of AGIs is going to be just as messy and uncoordinated, only unfolding much faster, and that’s not even counting the risk of getting a superintelligence right away.
I’m on the optimists discord and I do make the above argument explicitly in this presentation (e.g. slide 4): Reasons for optimism about superalignment (though, fwiw, Idk if I’d go all the way down to 1% p(doom), but I have probably updated something like 10% to <5%, and most of my uncertainty now comes more from the governance / misuse side).
On your points ‘Is massive effective acceleration enough?’ and ‘Will “human level” systems be sufficiently controlled to get enough useful work?’, I think conditioned on aligned-enough ~human-level automated alignment RAs, the answers to the above are very likely yes, because it should be possible to get a very large amount of work out of those systems even in a very brief amount of time—e.g. a couple of months (feasible with e.g. a coordinated pause, or even with a sufficient lead). See e.g. slides 9, 10 of the above presentation (and I’ll note that this argument isn’t new, it’s been made in variously similar forms by e.g. Ajeya Cotra, Lukas Finnveden, Jacob Steinhardt).
I’m generally reasonably optimistic about using human level-ish systems to do a ton of useful work while simultaneously avoiding most risk from these systems. But, I think this requires substantial effort and won’t clearly go well by default.
The “AI is easy to control” piece does talk about scaling to superhuman AI:
If we assume that each generation can ensure a relatively strong notion of alignment between it and the next generation, then I think this argument goes through.
However, there are weaker notions of control which are insufficient for this sort of bootstrapping argument. Suppose each generation can ensure a the following weaker notion of control “we can set up a training, evaluation, and deployment protocol with sufficient safeguards (monitoring, auditing, etc) such that we can avoid generation N+1 AIs being capable of causing catastrophic outcomes (like AI takeover) while using those AIs to speed up labor of the generation N by a large multiple”. This notion of control doesn’t (clearly) allow the bootstrapping argument to go through. In particular, suppose that all AIs smarter than humans are deceptively aligned and they defect on humanity at the point where they are doing tasks which would be extremely hard for a human to oversee. (This isn’t the only issue, but it is a sufficient counterexample.)
This weaker notion of control can be very useful in ensuring good outcomes via getting lots of useful work out of AIs, but we will likely need to build something more scalable eventually.
(See also my discussion of using human level ish AIs to automate safety research in the sibling.)
I agree with everything you wrote here and in the sibling comment: there are reasonable hopes for bootstrapping alignment as agents grow smarter; but without a concrete bootstrapping proposal with an accompanying argument, <1% P(doom) from failing to bootstrap alignment doesn’t seem right to me.
I’m guessing this is my biggest crux with the Quintin/Nora worldview, so I guess I’m bidding for—if Quintin/Nora have an argument for optimism about bootstrapping beyond “it feels like this should work because of iterative design”—for that argument to make it into the forthcoming document.
I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we’re talking about isn’t made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in particular, agentic vs. tool AI and RL-trained vs. predictive AI. These are not sharp categories, but distinguishing what part of the spectrum we’re primarily addressing would clarify discussions.
You’ve done an admirable job of doing that in this post, and doing so seems to make sense of your disagreements with Pope’s conclusions.
Pope appears to be talking primarily about LLMs, so the extent to which his logic applies to other forms of AI is unclear. As you note, that logic does not seem to apply to AI that is agentic (explicitly goal-directed), or to actor-critic RL agents.
That is not the only problem with that essay, but it’s a big one, since the essay comes to the conclusion that AI is safe, while analyzing only one type of AI.
I agree that human ethics is not the result solely of training, but has a critical component of innate drives to be pro-social. The existence of sociopaths whose upbringing was normal is pretty compelling evidence that the genetic component is causal.
While the genetic basis of prosocial behavior is probably simple in the sense that it is coded in a limited amount of DNA information and neural circuitry, it is likely quite complex in another sense: it is evolved to work properly in the context of a very particular type of environment, that of standard human experience. As such, I find it unlikely that those mechanisms would produce an aligned agent in a very different AI training regime, nor that that alignment would generalize to very different situations than humans commonly encounter.
As you note, even if we restricted ourselves to this type of AI, and alignment was easy, that would not reduce existential risks to near 1%. If powerful AI is accessible to many, someone is going to either make mistakes or deliberately use it destructively, probably rather quickly.
Seth I think another way to reframe this is to think of an alignment tax.
Utility = ( AI capability) * alignment loss.
Previous doom arguments were that all alignment was impossible, you could not build a machine with near human intelligence that was aligned. Aligned in this context means “acts to further the most probable interpretation of the users instructions”.
Nora et al and you concede above it is possible you can build machines with +- human intelligence and they are aligned per the above definition. So now the relationship becomes:
(Utility of most powerful ASI that current compute can find and run)* available resources ⇔ (utility of most powerful tool AI) * available resources.
In worlds where the less capable tool AIs, which are probably myopic “bureaucracies” of thousands of separate modules, times their resources have more total utility, some humans win.
In worlds where the most powerful actors give unrestricted models massive resources, or unrestricted models provide an enormous utility gain, that’s doom.
If the “alignment tax” is huge, humans eventually always lose. Political campaigning buys a little time but it’s a terminal situation for humans. While humans win some of the worlds where the tax is small.
Agree/disagree? Does this fit your model?
I agree that alignment taxes are a crucial factor in the odds of getting an alignment plan implemented. That’s why I’m focused on finding and developing promising alignment plans with low taxes.
I should note that human alignment methods works only with respect of the fact that no human in history could suddenly start to design nanotech in their head or treat other humans as buggy manipulable machinery. I think there are plenty of humans around who would want to become mind-controlling dictators given possibility or who are generally nice but would give in temptation.
This comment seems to be assuming some kind of hard takeoff scenario, which I discount as absurdly unlikely. That said, even in that scenario, I’m pretty optimistic about our white box alignment methods generalizing fine.
You can just have model with capabilities of smartest human hacker that exfiltrates itself, hacks 1-10% of world computing power with shittiest protection, distributes itself in Rosetta@home style and bootstrap whatever takeover plan using sheer bruteforce. Thus said, I see no reason for capabilities to land exactly on point “smartest human hacker”, because there is nothing special at this point, and it can be 2x, 5x, 10x, without any necessity to become 1000000x within a second.
And I still don’t get why! I would like to see your theory of generalization in DL that allows to have such level of optimism, and “gradient descent is powerful” simply doesn’t catch that.
My general thoughts about this post, divided into sections for easier explanation:
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#1__Even_if_controllable_AI_has_an__easy__technical_solution__I_d_still_be_pessimistic_about_AI_takeover
I kinda agree with this point, but I’d say a few things here:
You correctly mention that not all AI risk is solved by AI control being easy, because AI misuse can still be a huge factor, and I agree with this point, but there are still things that change if we grant AI is easy to control:
Most AI pause policies become pretty unjustifiable by default, at least without extra assumptions and in general a lot of AI slowdown movements like PauseAI become closer to the nuclear case, which I’d argue is quite negative. That alone would change the dynamic of a lot of AI safety, especially it’s nascent activist wing.
Misuse focused policy probably looks less technical, and more normal, for example Know Your Customer laws or hashing could be extremely important if we’re worried about misuse of AI for say bioterrorism.
On the balance between offense and defense, I actually think that it depends on the domain, with cyber being the easiest case for defense, bio being the worst case for defense but can be drastically improved, and other fields having more balance between defense and offense. However, I agree that bio is the reason I’d be rather pessimistic on AI being offense-advantaged, but this is an improvable outlier without having to restrict AI all that much, or at all.
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#2__Black_and_white__box__thinking
I agree that these words should mostly be tabooed here, and I mostly agree with the section, with the caveat that the information that we get from ML models is drastically better than any human, because we only have behavioral analysis from human beings to infer their values, which is a literal black box since we only get the behavior, not the causes of the behavior. We basically never get the intuitive explanation for why a human does something, except in the trivial cases.
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#3__What_lessons_do_we_learn_from__human_alignment___such_as_it_is__
I agree with some of this, but I’d say Story 1 applies only very weakly, and that the majority/supermajority of value learning is online, for example via the self-learning/within lifetime-RL algorithms you describe, without relying on the prior. In essence, I agree with the claim that the genes need to impose a prior, which prevents pure blank-slatism from working. I disagree with the claim that this means that genetics need to impose a very strong prior without relying on the self-learning algorithms you describe for capabilities.
This post might help you understand why, especially the top comment by karma also adds additional helping context for why a lot of complexity of value needs to be learned, rather than baked in as a prior:
https://www.lesswrong.com/posts/CQAMdzA4MZEhNRtTp/human-values-and-biases-are-inaccessible-to-the-genome
The main implication for human alignment is that deceptive alignment mostly does not work, or is easy to make not work, because the complexity of the aligned solution and deceptive solution is very similar, only 1-1000 lines of code or 1-1000 bits at most difference in bit complexity, and thus we need very little data to discriminate between the aligned solution and the deceptive solution, which means it’s very easy to solve deceptive alignment, and given the massive incentive for solving alignment for profit, may already be solvable by default.
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4__Can_we_all_agree_in_advance_to_disavow_this_whole__AI_is_easy_to_control__essay_if_future_powerful_AI_is_trained_in_a_meaningfully_different_way_from_current_LLMs_
For me, the answer is no, because my points apply outside of LLMs, and they can be formulated as long as the prior doesn’t completely dominate the learning process, which can certainly apply to brain-like AGI or model based RL.
It’s odd that you understood me as talking about misuse. Well, I guess I’m not sure how you’re using the term “misuse”. If Person X doesn’t follow best practices when training an AI, and they wind up creating an out-of-control misaligned AI that eventually causes human extinction, and if Person X didn’t want human extinction (as most people don’t), then I wouldn’t call that “misuse”. Would you? I I would call it a “catastrophic accident” or something like that. I did mention in the OP that some people think human extinction is perfectly fine, and I guess if Person X is one of those people, then it would be misuse. So I suppose I brought up both accidents and misuse.
People who I think are highly prone to not following best practices to keep AI under control, even if such best practices exist, include people like Yann LeCun, Larry Page, Rich Sutton, and Jürgen Schmidhuber, who are either opposed to AI alignment on principle, or are so bad at thinking about the topic of AI x-risk that they spout complete and utter nonsense. (example). That’s not a problem solvable by Know Your Customer laws, right? These people (and many more) are not the customers, they are among the ones doing state-of-the-art AI R&D.
In general, the more people are technically capable of making an out-of-control AI agent, the more likely that one of them actually will, even if best practices exist to keep AI under control. People like to experiment with new approaches, etc., right? And I expect the number of such people to go up and up, as algorithms improve etc. See here.
If KYC laws aren’t the answer, what is? I don’t know. I’m not advocating for any particular policy here.
Perhaps, but I want to create a distinction between “People train AI to do good things and aren’t able to control AI for a variety of reasons, and thus humans are extinct, made into slaves, etc.” and “People train AI to do stuff like bio-terrorism, explicitly gaining power, etc, and thus humans are extinct, made into slaves, etc.” Because the optimal responses look very different if we are in a world where control is easy but preventing misuse is hard, vs if controlling AI is hard in itself, because AI safety actions as currently done are optimized far more for the case where controlling AI by humans is hard or impossible, but if this is not the case, then pretty drastic changes would need to be made in how AI safety organizations do their work, especially their nascent activist wing, and instead focus on different policies.
I note that your example of them spouting nonsense only has the full force it does if we assume that controlling AI is hard, which is what we are debating right now.
Onto my point here, my fundamental claim is that there’s a counterforce to what you describe to the claim that there will be more and more people being able to make an out of control AI agent, and that is the profit motive.
Hear me out, this will actually make sense here.
Basically, the main reasons that the profit motive is positive for safety is that the negative externalities of AI being not controllable is far, far more internalized to the person who’s making the AI, since they also suffer severe losses in profitability without getting any profit from the AI. This is combined with the fact that they also have profit in developing safe control techniques, assuming that it isn’t very hard, since the safe techniques will probably get used in government standards for releasing AI, and there’s already at least some fairly severe barriers to any release of misaligned AGI, at least assuming that there’s no treacherous turn/deceptive alignment over weeks-months.
Jaime Sevilla basically has a shorter tweet on why this is the case, and I also responded to LInch making something like the points above:
https://archive.is/wPxUV
https://twitter.com/Jsevillamol/status/1722675454153252940
https://archive.is/3q0RG
https://twitter.com/SharmakeFarah14/status/1726351522307444992
You keep talking about “prior” but not mentioning “reward function”. I’m not sure why. For human children, do you think that there isn’t a reward function? Or there is a reward function but it’s not important? Or do you take the word “prior” to include reward function as a special case?
If it’s the latter, then I dispute that this is an appropriate use of the word “prior”. For example, you can train AlphaZero to be superhumanly skilled at winning at Go, or if you flip the reward function then you’ll train AlphaZero to be superhumanly skilled at losing at Go. The behavior is wildly different, but is the “prior” different? I would say no. It’s the same neural net architecture, with the same initialization and same regularization. After 0 bits of training data, the behavior is identical in each case. So we should say it’s the same “prior”, right?
(As I mentioned in the OP, on my models, there is a human innate reward function, and it’s absolutely critical to human prosocial behavior, and unfortunately nobody knows what that reward function is.)
So what I’m trying to get at here is essentially the question “how much can we offload the complexity of values to the learning system” rather than say, directly specify it via the genome, say. In essence, I’m focused on the a priori complexity of human values and the human innate reward function, since this variable often is a key disagreement between optimists and pessimists on controlling AI, and in particular it especially matters for how likely deceptive alignment is to occur relative to actual alignment, which is both a huge and popular threat model.
Re the reward function, the prior discussion also sort of applies here, because if it is learnable or otherwise is simple to hardcode, then it means that other functions probably will work just as well without relying on the human reward function, and if it’s outright learnable by AI, then it’s almost certainly going to be learned (conditional on the reward function being simple.) before anything else, especially the deceptively aligned algorithm if it’s simpler, and if not, then it’s only slightly more complex, so we can easily provide very little data to distinguish between the 2 algorithms, which is what I view the situation involving the human
My crux is that this statement is probably false, conditioning on either it being very simple to hardcode, as in a few lines say, or is learnable by the self-learning algorithm/within-lifetime RL/online learning algorithms you consider:
“The human innate reward function is absolutely critical to human prosocial behavior.”
Putting it another way, I deny the specialness of the innate reward function in humans being the main driver, because most of that reward function has to be learned, which could be replicated by brain-like AGI/Model-Based RL via online learning, thus most of the complexity does not matter, and that also probably implies that most of the complex prosocial behavior is fundamentally replicable by a brain-like AGI/Model-Based RL agent without having to have the human innate reward function.
The innate function obviously has some things hard-coded a priori, and there is some complexity in the reward function, but not nearly as much as a lot of people think, since IMO a lot of the reward function/human prosocial values are fundamentally learned and almost certainly replicable by a Brain-like AGI paradigm, even if it didn’t use the exact innate reward function the human uses.
Some other generalized updates I made were these, this is quoted from a discord I’m in, credit to TurnTrout for noticing this:
I find your text confusing. Let’s go step by step.
AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
By analogy:
The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
Do you agree with all that?
If so, then there’s no getting around that getting the right innate reward function is extremely important, right?
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.
So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.
To respond to this:
This doesn’t actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.
Essentially, I’m focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.
So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven’t enumerated because it would make this comment too long.
I have an actual model now that I’ve learned more, so to answer the question below:
To answer what algorithm exactly, it could well be the same algorithm that the AI uses for it’s capabilities like MCTS or AlphaZero’s algorithm or a future AGI’s capability algorithms, but the point is that the algorithm matters less than the data, especially as the data gets larger and larger so the really important question is how to make the dataset, and that’s answered in my comment below:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg#BxNLNXhpGhxzm7heg
I also want to point out that as it turns out, alignment generalizes farther than capabilities for some pretty deep reasons given below, but short answer, it’s due to both the fact that verifying that your values was satisified is in many cases easier than actually executing those values out in the world, combined with values data being easier to learn than other capabilities data:
The link is given below:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
In essence, what I’m doing here is unifying the capabilities and value reward functions, and pointing out that with total control of the dataset and densely defined rewards, we can prevent a lot of misaligned objectives from appearing, since the algorithm is less important than the data.
I think the key crux is that all of the differences, or almost all of the differences mediate through searching for different data, and if you had the ability to totally control a sociopath’s data sources, they’d learn a different reward function that is way closer to what you want the reward function to be as.
If you had the ability to control people’s data and reward functions as much as ML people did today, you could trivially brainwash them to accept almost arbitrary facts and moralities, and it would be one of the most used technologies in politics.
But for alignment, this is awesome news, because it lets us control what exactly is rewarded, and what their values are like.
I strongly agree with Section 1. Even if we would have aligned superintelligence, how are we going to make sure no one runs an unaligned superintelligence? A pivotal act? If so, which one? Or does defense trump offense? If so, why? Or are we still going to regulate heavily? If so, wouldn’t the same regulation be able to stop superintelligence altogether?
Would love to see an argument landing at 1% p(doom) or lower, even if alignment would be easy.
The argument is probably something like superintelligence allows robust monitoring of other people’s attempts at building superintelligences, and likely could resolve whatever prisoners dilemmas people have to build them. We don’t need an authoritarian regime.
I expect an ASI can denoise easily available data enough to confidently figure out who is trying to build a superintelligence, and either stop them through soft power/argumentation or implore them to make it aligned.
I think most problems are probably easy to solve with a truly aligned superintelligence. It may come up with good solutions we haven’t even thought of.
I think you may be right that this is what people think of. It seems pretty incompatible with any open source-ish vision of AGI. But what I’m most surprised at, is that people call supervision by humans dystopian/authoritarian, but the same supervision by an ASI (apparently able to see all your data, stop anyone from doing anything, subtly manipulate anyone, etc etc) a utopia. What am I missing here?
Personally, by the way, I imagine a regulation regime to look like regulating a few choke points in the hardware supply chain, plus potentially limits to the hardware or data a person can possess. This doesn’t require an authoritarian regime at all, it’s just regular regulation as we have in many domains already.
In any case, the point was, is something like this going to lead to <=1% xrisk? I think it doesn’t, and definitely not mixed with a democratic/open source AGI vision.
If this is your definition of a dystopia we already live in a dystopia. You can’t make nuclear bombs without being picked up by the FBI/CIA and you’ll probably be arrested in the process. Making something illegal doesn’t define an authoritarian regime. Governments already try to stop international players from building nukes. It just lacks teeth because you can ultimately live with sanctions.
The other problem is it’s way too easy to avoid surveillance or defect in a human regime. For example, you can have a decentralized training network, claim you are training good AIIt’s also unusually easy to regulate AI training. Right now GPUs are easy to control because only Nvidia can make them. This won’t always be true. It’s also much easier to hide GPU production than making nukes because we need GPUs and CPUs for a ton of other useful things.
Theoretically an ASI could probably extrapolate attempts to use compute correctly from your internet signals. Further, if you have the benefits you want from an ASI, you have much less reason to build a 2nd one that’s possibly unaligned. “The digital god says you can’t build it” probably sounds a lot more compelling than “Joe Biden says you can’t build it”.
This is a good question to ask, and my general answer is to this question is a combination of 2 answers, defense doesn’t trump offense in all domains, but they are much more balanced than LWers think, with the exception of bio, and that domain mostly doesn’t have existential risky products. Regulation is necessary for superintelligence, but I don’t think this is anywhere near true:
No, primarily because misuse/structural issues demand very different responses, and a lot of the policy making pretty much relies on the assumption that AI is hard to control.
Much more generally, I wish people would make distinctions between existential risk caused by lack of control vs existential risk caused by misuse vs mass death caused by structural forces, since each of these relies on very differing causes, and that matters since the policy conclusions are very different, and sometimes even opposed to each other.
Open Sourcing is a case in point. It’s negative in the cases where misuse and loss control is the dominant risk factor, but turns into a positive if we instead assume structural forces are at work, like in dr_s’s story here:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
More generally, policies for one scenario will not work for other scenarios for AI risk automatically, it’s a case by case basis.
I mostly agree with what you say, just registering my disagreement/thoughts on some specific points. (Note that I haven’t yet read the page you’re responding to.)
Maybe? Depends on what exactly you mean by the word “might”, but it doesn’t seem obvious to me that this would need to be the case. My intuition from seeing the kinds of interpretability results we’ve seen so far, is that within less of a decade we’d already have a pretty rigorous theory and toolkit for answering these kinds of questions. At least assuming that we don’t keep switching to LLM architectures that work based on entirely different mechanisms and make all of the previous interpretability work irrelevant.
If by “might” you mean something like a “there’s at least a 10% probability that this could take decades to answer” then sure, I’d agree with that. Now I haven’t actually thought about this specific question very much before seeing it pop up in your post, so I might radically revise my intuition if I thought about it more, but at least it doesn’t seem immediately obvious to me that I should assign “it would take decades of work to answer this” a very high probability.
I would assume the intuition to be something like “if they’re simple, then given the ability to experiment on minds and access AI internals, it will be relatively easy to figure out how to make the same drives manifest in an AI; the amount of (theory + trial and error) required for that will not be as high as it would be if the drives were intrinsically complex”.
There’s something to that, but this sounds too strong to me. If someone had hypothetically spent a year observing all of my behavior, having some sort of direct read access to what was happening in my mind, and also doing controlled experiments where they reset my memory and tested what happened with some different stimulus… it’s not like all of their models would become meaningless the moment I read the morning newspaper. If I had read morning newspapers before, they would probably have a pretty good model of what the likely range of updates for me would be.
Of course, if there was something very unexpected and surprising in the newspaper, that might cause a bigger update, but I expect that they would also have reasonably good models of the kinds of things that are likely to trigger major updates or significant emotional shifts in me. If they were at all competent, that’s specifically the kind of thing that I’d expect them to work on trying to find out!
And even if there was a major shift, I think it’s basically unheard of that literally everything about my thoughts and behavior would change. When I first understood the potentially transformative impact of AGI, it didn’t change the motor programs that determine how I walk or brush my teeth, nor did it significantly change what kinds of people I feel safe around (aside for some increase in trust toward other people who I felt “get it”). I think that human brains quite strongly preserve their behavior and prediction structures, just adjusting them somewhat when faced with new information. Most of the models and predictions you’ve made about an adult will tend to stay valid, though of course with children and younger people there’s much greater change.
In some sense yes, but it does also seem to me that prediction and desire does get conflated in humans in various ways, and that it would be misleading to say that the people in question want it. For example, I think about this post by @romeostevensit often:
It’s, of course, true that for an LLM, prediction is the only thing it can do, and that humans have a system of desires on top of that. But it looks to me that a lot of human behavior is just having LLM-ish predictive models of how someone like them would behave in a particular situation, which is also the reason why conceptual reframings the like one you can get in therapy can be so powerful (“I wasn’t lazy after all, I just didn’t have the right tools for being productive” can drastically reorient many predictions you’re making of yourself and thus your behavior). (See also my post on human LLMs, which has more examples.)
While it’s obviously true that there is a lot of stuff operating in brains besides LLM-like prediction, such as mechanisms that promote specific predictive models over other ones, that seems to me to only establish that “the human brain is not just LLM-like prediction”, while you seem to be saying that “the human brain does not do LLM-like prediction at all”. (Of course, “LLM-like prediction” is a vague concept and maybe we’re just using it differently and ultimately agree.)
I disagree with whether that distinction matters:
I think technical discussions of AI safety depend on the AI-algorithm-as-a-whole; I think “does the algorithm have such-and-such component” is not that helpful a question.
So for example, here’s a nightmare-scenario that I think about often:
(step 1) Someone reads a bunch of discussions about LLM x-risk
(step 2) They come down on the side of “LLM x-risk is low”, and therefore (they think) it would be great if TAI is an LLM as opposed to some other type of AI
(step 3) So then they think to themselves: Gee, how do we make LLMs more powerful? Aha, they find a clever way to build an AI that combines LLMs with open-ended real-world online reinforcement learning or whatever.
Even if (step 2) is OK (which I don’t want to argue about here), I am very opposed to (step 3), particularly the omission of the essential part where they should have said “Hey wait a minute, I had reasons for thinking that LLM x-risk is low, but do those reasons apply to this AI, which is not an LLM of the sort that I’m used to, but rather it’s a combination of LLM + open-ended real-world online reinforcement learning or whatever?” I want that person to step back and take a fresh look at every aspect of their preexisting beliefs about AI safety / control / alignment from the ground up, as soon as any aspect of the AI architecture and training approach changes, even if there’s still an LLM involved. :)
I dunno, I wrote “invalid (or at least, open to question)”. I don’t think that’s too strong. Like, just because it’s “open to question”, doesn’t mean that, upon questioning it, we won’t decide it’s fine. I.e., it’s not that the conclusion is necessarily wrong, it’s that the original argument for it is flawed.
Of course I agree that the morning paper thing would probably be fine for humans, unless the paper somehow triggered an existential crisis, or I try a highly-addictive substance while reading it, etc. :)
Some relevant context is: I don’t think it’s realistic to assume that, in the future, AI models will be only slightly fine-tuned in a deployment-specific way. I think the relevant comparison is more like “can your values change over the course of years”, not “can your values change after reading the morning paper?”
Why do I think that? Well, let’s imagine a world where you could instantly clone an adult human. One might naively think that there would be no more on-the-job learning ever. Instead, (one might think), if you want a person to help with chemical manufacture, you open the catalog to find a human who already knows chemical manufacturing, and order a clone of them; and if you want a person to design widgets, you go to a different catalog page, and order a clone of a human widget design expert; so on.
But I think that’s wrong.
I claim there would be lots of demand to clone a generalist—a person who is generally smart and conscientious and can get things done, but not specifically an expert in metallurgy or whatever the domain is. And then, this generalist would be tasked with figuring out whatever domains and skills they didn’t already have.
Why do I think that? Because there’s just too many possible specialties, and especially combinations of specialties, for a pre-screened clone-able human to already exist in each of them. Like, think about startup founders. They’re learning how to do dozens of things. Why don’t they outsource their office supply questions to an office supply expert, and their hiring questions to a hiring expert, etc.? Well they do to some extent, but there are coordination costs, and more importantly the experts would lack all the context necessary to understand what the ultimate goals are. What are the chances that there’s a pre-screened clone-able human that knows about the specific combination of things that a particular application needs (rural Florida zoning laws AND anti-lock brakes AND hurricane preparedness AND …)
So instead I expect that future AIs will eventually do massive amounts of figuring-things-out in a nearly infinite variety of domains, and moreover that the figuring out will never end. (Just as the startup founder never stops needing to learn new things, in order to succeed.) So I don’t like plans where the AI is tested in a standardized way, and then it’s assumed that it won’t change much in whatever one of infinitely many real-world deployment niches it winds up in.
What I do not get is how this disagreement on p(doom) leads to different policy proposals.
If ASI has a 99% probability of killing us all, it is the greatest x-risk we face today and we should obviously be willing to postpone ASI, and possibly singularity (to the extend that in the far future, the diameter of the region of space we colonize at any given time will be a few 100 light years less than what it would be if we focused just on capabilities now).
If ASI has a 1% probability of killing us all, it is still the (debatably) greatest x-risk we face today and we should obviously be willing to postpone ASI etcetera.
To persuade that ASI is safe, one would either not have to care about the far future (for an individual alive today, a 99% of chance of living in a Culture-esque utopia would probably be worth a 1% risk of dying slightly earlier) or provide a much lower p(doom) (e.g. “p(doom)=1e-20, all the x-risk comes from god / the simulators destroying the universe once humans develop ASI, and spending a few centuries on theological research is unlikely to change that” would recommend “just build the damn thing” as a strategy.)
In one sentence, the main reason it matters is because once we drop the assumption of long-termism and impose a limit to how far we care about the future, a 1% probability will give you massive differences in policy compared to a 99% probability of doom, especially if we assume that the benefits and risks are mostly symmetrical. A 1% probability implies that AI should be regulated for tail risks, but a lot of the policies like say a single organization developing AGI or a broad pause become negative EV, under certain other assumptions. 99% obviously flips the script, and massive stoppage of AI, even at the risk of bringing billions of deaths is now positive EV.
And this is worse once we introduce prospect theory, which roughly argues that we overestimate how much we should react to low probability high impact events, because we anchor to misleadingly high probability numbers like 1%, and thus we are likely to massively overestimate the probability of AI doom conditional on the assumption of AI being easy to control being correct.
Strong Evidence is Common generates a way for very low or high probability events to occur, because 1 bit halves the probability conditional on independence.
https://www.lesswrong.com/posts/JD7fwtRQ27yc8NoqS/strong-evidence-is-common
The relevant thing is how probability both gets clearer and improves with further research enabled by pause. Currently, as a civilization we are at the startled non-sapient deer stage, that’s not a position from which to decide the future of the universe.
I can make the same argument for how probability gets clearer and improves with further research enabled by not pausing, and I actually think this is the case both in general and for this specific problem, so this argument doesn’t work.
Neither this post nor the essay it’s responding to is about policy proposals. So why talk about it? Two points:
As a general principle, if there are two groups who wildly disagree about the facts on the ground, but nevertheless (coincidentally) agree about what policies they favor, then I say they should still probably try to resolve their disagreements if possible, because it’s generally good to have accurate beliefs, e.g. what if both of them are wrong? And maybe that coincidence will not always be true anyway.
It’s not true that the only choice on offer is “Should we ever build ASI? Yes or no?” In fact, that choice (per se) is not on offer at all. What there is, is a gazillion conceivable laws that could be passed, all of which have a wide and idiosyncratic array of intended and unintended consequences. Beyond that, there are a gazillion individual decisions that need to be made, like what careers to pursue, what to donate to, whether to publish or not publish particular things, whether to pursue or not pursue particular lines of research, etc. etc. I find it extraordinarily unlikely that, if Person A thinks p(doom)=99% and Person B thinks p(doom)=1%, then they’re going to agree on all these gazillions of questions. (And empirically, it seems to be clearly not the case that the p(doom)=1% people and the p(doom)=99% people agree on questions of policy.)
Aligning human-level AGIs is important to the extent there is risk it doesn’t happen before it’s too late. Similarly with setting up a world where initially aligned human-level AGIs don’t soon disempower humans (as literal humans might in the shoes of these AGIs), or fail to protect the world from misused or misaligned AGIs or superintelligences.
Then there is a problem of aligning superintelligences, and of setting up a world where initially aligned superintelligences don’t cause disempowerment of humans down the line (whether that involves extinction or not). Humanity is a very small phenomenon compared to a society of superintelligences, remaining in control of it is a very unusual situation. (Humanity eventually growing up to become a society of superintelligences while holding off on creating a society of alien superintelligences in the meantime seems like a more plausible path to success.)
Solving any of these problems doesn’t diminish importance of the others, which remain as sources of possible doom, unless they too get solved before it’s too late. Urgency of all of these problems originates from the risk of succeeding in developing AGI. Tasking the first aligned AGIs with solving the rest of the problems caused by the technology that enables their existence seems like the only plausible way of keeping up, since by default all of this likely occurs in a matter of years (from development of first AGIs). Though economic incentives in AGI deployment risk escalating the problems faster than AGIs can implement solutions to them. Just as initial development of AGIs risks creating problems faster than humans can prepare for them.
I don’t confidently disagree with this statement, but it occurs to me that I haven’t tried it myself and haven’t followed it very closely, and have sometimes heard claims that there are promising methods.
A lot of people trying to come up with answers try to do it with mechanistic interpretability, but that probably isn’t very feasible. However, investigations based on ideas like neural tangent kernels seem plausibly more satisfying and feasible. Like if you show that the dataset contains a bunch of instances that’d push it towards saying apple rather than banana, and you then investigate where those data points come from and realize that there’s actually a pretty logical story for them, then that seems basically like success.
As an example, I remember a while ago there was some paper that claimed to have found a way to attribute NN outputs to training data points, and it claimed that LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions. I didn’t read the paper so I don’t know whether it’s legit, but that sort of thing seems quite plausibly feasible a lot of the time.
Perhaps you’re thinking of the recent influence function work from Anthropic?
I don’t think that this paper either shows or claims that “LLM power-seeking was mainly caused by sci-fi stories and by AI safety discussions”. But they do find that there are influential training examples from sci-fi stories and AI safety discussion when asking the model questions about topics like this.
There is one thing that I’m worried about in future of LLM. This is a basic notion that the whole is not always just the sum of parts and it may have very different properties.
Many people feel safe because of properties of LLM and how they are trained etc. and because we are not anywhere close to AGI when it comes to different solutions which seem more dangerous. What they don’t realize is that soonest AGI likely won’t be a next bigger LLM model.
It will likely be amalgamation of few models and pieces of programming, including few LLM of different sizes and capabilities, maybe not all exactly chat-like. It will have different properties than any one of its parts and it will be different than single LLM. It might be more brain-like when it comes to learning and memories. Maybe not in a way that weights of LLM models will change, but some inner state will change, and some more basic parts will learn or remember solutions and will structure them into more complex solutions (like we remember how to drive without consciously deciding on each move of the muscle or even making higher level decisions). It will have goals, priorities, strategies and short-time tactic schemes and it will be processed on higher level than single LLM.
Why I do think that? Because it is already to be seen on the horizon if you think about works like multi-model GPT-4, GPT Engineer, multitude of projects adding long term memory for GPT, and that scientific works where GPT writes itself code to bootstrap itself into doing complex tasks like achieving goals in Minecraft. If you extrapolate that then AGI is likely, initially maybe not very fast or cheap one though. It is likely to be on top of LLM but not being simply an LLM.
This sentence is adjacent to my core concern regarding AI alignment, and why I’m not particularly reassured by the difficulty-of-superhuman-performance or return-on-compute reassurances regarding AGI: we don’t need superhuman AI to deal superhuman-seeming amounts of damage. Indeed, even today’s “perfectly-sandboxed” models (in that according to the most reliable publicly-available information none of the most cutting-edge models are allowed direct read/write access to the systems which would allow them to plot and attain world domination or the destruction of humanity (or specific nations’ interests) have the next-best thing: whenever a new technological lever emerges in the world, humans with malicious intentions are empowered to a much greater degree than those who want strictly the best[1] for everybody. There are also bit-flip attacks on aligned AI which are much harder to implement on humans.
Using “best” is fraught but we’ll pretend that “world best-aligned with a Pareto-optimal combination of each person’s expressed reflective preferences and revealed preferences, to the extent that those revealed preferences do not represent akrasia or views and preferences which the person isn’t comfortable expressing directly and publicly but does indeed have” is an adequate proxy to continue along this line of argument; the other option is developing a provably-correct theory of morality and politics which would take more time than this comment by 2-4 orders of magnitude.
Just use the Chebyshev (aka maximum or L∞) metric.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?