In the last year, I’ve had surprisingly many conversations that have looked a bit like this:
Me: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Interlocutor: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
Me: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Interlocutor: “Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values.”
[… The conversation then repeats, with both sides repeating the same points...]
[Edited to add: I am not claiming that the alignment is definitely very easy. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable. I understand that solutions that work for GPT-4 may not scale to radical superintelligence. I am talking about whether it’s reasonable to give a significant non-zero update on alignment being easy, rather than whether we should update all the way and declare the problem trivial.]
Here’s how that discussion would go if you had it with me:
You: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Me: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
You: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
Pulling some quotes from Superintelligence page 117:
Consider the following scenario. Over the coming years and decades, AI systems become gradually more capable and as a consequence find increasing real-world application: they might be used to operate trains, cars, industrial and household robots, and autonomous military vehicles. We may suppose that this automation for the most part has the desired effects, but that the success is punctuated by occasional mishaps—a driverless truck crashes into oncoming traffic, a military drone fires at innocent civilians. Investigations reveal the incidents to have been caused by judgment errors by the controlling AIs. Public debate ensues. Some call for tighter oversight and regulation, others emphasize the need for research and better-engineered systems—systems that are smarter and have more common sense, and that are less likely to make tragic mistakes. Amidst the din can perhaps also be heard the shrill voices of doomsayers predicting many kinds of ill and impending catastrophe. Yet the momentum is very much with the growing AI and robotics industries. So development continues, and progress is made. As the automated navigation systems of cars become smarter, they suffer fewer accidents; and as military robots achieve more precise targeting, they cause less collateral damage. A broad lesson is inferred from these observations of real-world outcomes: the smarter the AI, the safer it is. It is a lesson based on science, data, and statistics, not armchair philosophizing. Against this backdrop, some group of researchers is beginning to achieve promising results in their work on developing general machine intelligence. The researchers are carefully testing their seed AI in a sandbox environment, and the signs are all good. The AI’s behavior inspires confidence—increasingly so, as its intelligence is gradually increased. At this point any remaining Cassandra would have several strikes against her: i. A history of alarmists predicting intolerable harm from the growing capabilities of robotic systems and being repeatedly proven wrong. Automation has brought many benefits and has, on the whole, turned out safer than human operation. ii. A clear empirical trend: the smarter the AI, the safer and more reliable it has been. Surely this bodes well for any project aiming at creating machine intelligence more generally smart than any ever built before—what is more, machine intelligence that can improve itself so that it will become even more reliable. iii. large and growing industries with vested interests in robotics and machine intelligence. These fields are widely seen as key to national economic competitiveness and military security. Many prestigious scientists have built their careers laying the groundwork for the present applications and the more advanced systems being planned. iv. A promising new technique in artificial intelligence, which is tremendously exciting to those who have participated in or followed the research. Although safety and ethics issues are debated, the outcome is preordained. Too much has been invested to pull back now. AI researchers have been working to get to human-level artificial general intelligence for the better part of a century; of course there is no real prospect that they will now suddenly stop and throw away all this effort just when it finally is about to bear fruit. v. The enactment of some safety rituals, whatever helps demonstrate that the participants are ethical and responsible (but nothing that significantly impedes the forward charge) vi. A careful evaluation of seed AI in a sandbox environment, showing that it is behaving cooperatively and showing good judgment. After some further adjustments, the test results are as good as they could be. It is a green light for the final step...
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom’s story. This seems obvious to me.
Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I’m explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs are different because they’re quite general, in precisely the ways that Bostrom imagined could be dangerous. For example, they seem to understand the idea of an off-switch, and can explain to you verbally what would happen if you shut them off, yet this fact alone does not make them develop an instrumentally convergent drive to preserve their own existence by default, contra Bostrom’s theorizing.
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
I thought you would say that, bwahaha. Here is my reply:
(1) Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: “A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly … A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to ‘make the project’s sponsor happy.’ Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner… until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain...” My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc. -- they aren’t plotting against us yet, but their ‘values’ aren’t exactly what we want, and so if somehow their ‘intelligence’ was amplified dramatically whilst their ‘values’ stayed the same, they would eventually realize this and start plotting against us. (realistically this won’t be how it happens since it’ll probably be future models trained from scratch instead of smarter versions of this model, plus the training process probably would change their values rather than holding them fixed). I’m not confident in this tbc—it’s possible that the ‘values’ so to speak of GPT4 are close enough to perfect that even if they were optimized to a superhuman degree things would be fine. But neither should you be confident in the opposite. I’m curious what you think about this sub-question.
(2) This passage deserves a more direct response:
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
(3) Here’s my positive proposal for what I think is happening. There was an old vision of how we’d get to AGI, in which we’d get agency first and then general world-knowledge second. E.g. suppose we got AGI by training a model through a series of more challenging video games and simulated worlds and then finally letting them out into the real world. If that’s how it went, then plausibly the first time it started to actually seem to be nice to us, was because it was already plotting against us, playing along to gain power, etc. We clearly aren’t in that world, thanks to LLMs. General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective. However, I don’t think me or Yudkowsky or Bostrom or whatever strongly predicted that agency would come first. I do think that LLMs should be an update towards hopefulness about the technical alignment problem being solved in time for the reasons mentioned, but also they are an update towards shorter timelines, for example, and an update towards more profits and greater vested interests racing to build AGI, and many other updates besides, so I don’t think you can say “Yudkowsky’s still super doomy despite this piece of good news, he must be epistemically vicious.” At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However [...] Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc.
When stated that way, I think what you’re saying is a reasonable point of view, and it’s not one I would normally object to very strongly. I agree it’s “plausible” that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making:
We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this.
We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
The fact that Bostrom’s central example of a reason to think that “when dumb, smarter is safer; yet when smart, smarter is more dangerous” doesn’t fit for LLMs, seems adequate for demonstrating (2), even if we can’t go as far as demonstrating (1).
It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world.
Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
I have two general points to make here:
I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
Here’s my positive proposal for what I think is happening. [...] General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective.
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
As we have discussed in person, I remain substantially more optimistic about our ability to coordinate in the face of an intelligence explosion (even a potentially quite localized one). That said, I think it would be best to save that discussion for another time.
We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
Depending on what you mean by “on their way towards being solved” I’d agree. The way I’d put it is: “We didn’t know what the path to AGI would look like; in particular we didn’t know whether we’d have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that’s good in some ways and bad in other ways, it’s probably overall good. Huzzah! However, our core problems remain, and we don’t have much time left to solve them.”
(Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul’s stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)
I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
I don’t think that we know how to “just create the corrigible AIs.” The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won’t work on much more agentic AIs. To be clear I think they might work, there’s a lot of uncertainty, but I think they probably won’t. I think it might be easier to see why I think this if you try to prove the opposite in detail—like, write a mini-scenario in which we have something like AutoGPT but much better, and it’s being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it’s own successor. (I’m trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
Whereas my model looks more like,
In years 1-4 systems will get gradually more agentic
There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!
I’d say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs’ ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don’t work as evidence about this either way.
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
Or that they have a sycophancy drive. Or that, next to “wanting to be helpful,” they also have a bunch of other drives that will likely win over the “wanting to be helpful” part once the system becomes better at long-term planning and orienting its shards towards consequentialist goals.
On that latter model, the “wanting to be helpful” is a mask that the system is trained to play better and better, but it isn’t the only thing the system wants to do, and it might find that once its gets good at trying on various other masks to see how this will improve its long-term planning, it for some reason prefers a different “mask” to become its locked-in personality.
Note that LLMs, while general, are still very weak in many important senses.
Also, it’s not necessary to assume that LLM’s are lying in wait to turn treacherous. Another possibility is that trained LLMs are lacking the mental slack to even seriously entertain the possibility of bad behavior, but that this may well change with more capable AIs.
I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.
I’m just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you’d get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.
To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).
Yes, this evidence is not conclusive. It is not zero either.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give—I just don’t think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make.
We don’t need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly “Yes”.
(Note that you can trivially claim the problem here isn’t being solved because we haven’t solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that’s not very common when theorizing about these matters. I’m frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I’m pointing at here. That said, I don’t think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.
To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it’s hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it’s still useful to point out that we seem to be making progress on several problems that people pointed out at the time.
Great, let’s talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that’s what you had said. E.g. suppose you had said “Hey, why don’t we just prompt AutoGPT-5 with lots of corrigibility instructions?” then we could have a more technical conversation about whether or not that’ll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
I don’t think current system systems are well described as having “big picture awareness”. From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud.
I’m not certain this was your claim, but it seems to have been.
it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud
Wouldn’t reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
My claim was not that current LLMs have a high level of big picture awareness.
Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.
And I’d predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.
And if you think that wouldn’t count as an adequate solution to the problem, then it’s not clear the problem was coherent as written in the first place.
There were an awful lot of early writings. Some of them did say that the difficulties with getting AGI to understand values is a big part of the alignment problem. The List of Lethalities does make that claim. The difficulty of getting the AGI to care even if it does understand has also been a big part of the public-facing debate. I look at some of the historical arguments in The (partial) fallacy of dumb superintelligence, written partly in response to Matthew’s post on this topic.
Obsessing about what happened in the past is probably a mistake. It’s probably better to ask: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment?
My answer is yes, and in a way that’s not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world.
Naturally that answer is a bit complex, so it’s spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
I don’t think this is true and can’t find anything in the post to that effect. Indeed, the post says things that would be quite incompatible with that claim, such as point 21.
In sum, I see that claim as I remembered it, but it’s probably not applicable to this particular discussion, since it addresses an entirely distinct route to AGI alignment. So I stand corrected, but in a subtle way that bears explication.
So I apologize for wasting your time. Debating who said what when is probably not the best use of our limited time to work on alignment. But because I made the claim, I went back and thought about and wrote about it some more, again.
I was thinking of point 21.1:
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It’s not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
BUT, point 24 in whole is saying that there are two approaches, 1) above, and a quite separate route 2), build a corrigible AI that doesn’t fully understand our values. That is probably the route that Matthew is thinking of in claiming that LLMs are good news. Yudkowsky is explicit that the difficulty of getting AGI to understand values doesn’t apply to that route, so that difficulty isn’t relevant here. That’s an important but subtle distinction.
Therefore, I’m far from the only one getting confused about that issue, as Yudkowsky states in that section 24. Disentangling those claims and how they’re changed by slow takeoff is the topic of my post cited above.
I personally think that sovereign AGI that gets our values right is out of reach exactly as Yudkowsky describes in the quotation above. But his arguments against corrigible AGI are much weaker, and I think that route is very much achievable, since it demands that the AGI have only approximate understanding of intent, rather than precise and stable understanding of our values. The above post and my recent one on instruction-following AGI make those arguments in detail. Max Harms’ recent series on corrigible AGI makes a similar point in a different way. He argues that Yudkowsky’s objections to corrigibility as unnatural do not apply if that’s the only or most important goal; and that it’s simple and coherent enough to be teachable.
That’s me switching back to the object level issues, and again, apologies for wasting your time making poorly-remembered claims about subtle historical statements.
There’s AGI that’s our first try, which should only use least dangerous cognition necessary for preventing immediately following AGIs from destroying the world six months later. There’s misaligned superintelligence that knows, but doesn’t care. Taken together, these points suggest that getting AGI to understand values is not an urgent part of the alignment problem in the sense of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires. Getting AGI to understand corrigibility for example might be more relevant, if we are running with the highly dangerous kinds of cognition implied by general intelligence of LLMs.
As you say, these things have been understood for a long time. I’m a bit disturbed that more serious alignment people don’t talk about them more. The difficulty of value alignment makes it likely irrelevant for the current discussion, since we very likely are going to rush ahead into, as you put it and I agree,
the highly dangerous kinds of cognition implied by general intelligence of LLMs.
The perfect is the enemy of the good. We should mostly quit worrying about the very difficult problem of full value alignment, and start thinking more about how to get good results with much more achievable corrigible or instruction-following AGI.
I think if you led with this statement, you’d have a lot less unproductive argumentation. It sounds on a vibe level like you’re saying alignment is probably easy in your first statement. If you’re just saying it’s less hard than originally predicted, that sounds a lot more reasonable.
Rationalists have emotions and intuitions, even if we’d rather not. Framing the discussion in terms of its emotional impact matters.
Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
1. Intelligence forces reflective coherence. This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
2. Agentic AI acting in the real world is different from LLMs. If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.
Personal belief: These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.
I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.
For others who want the resolution to this cliffhanger, what does Bostrom predict happens next?
The remainder of this section:
We observe here how it could be the case that when dumb, smarter is safer; yet when smart, smarter is more dangerous. There is a kind of pivot point, at which a strategy that has previously worked excellently suddenly starts to backfire. We may call the phenomenon the treacherous turn.
The treacherous turn — While weak, an AI behaves cooperatively (increasingly so, as it gets smarter). When the AI gets sufficiently strong — without warning or provocation — it strikes, forms a singleton, and begins directly to optimize the world according to the criteria implied by its final values.
A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted too narrowly. For example, an AI might not play nice in order that it be allowed to survive and prosper. Instead, the AI might calculate that if it is terminated, the programmers who built it will develop a new and somewhat different AI architecture, but one that will be given a similar utility function. In this case, the original AI may be indifferent to its own demise, knowing that its goals will continue to be pursued in the future. It might even choose a strategy in which it malfunctions in some particularly interesting or reassuring way. Though this might cause the AI to be terminated, it might also encourage the engineers who perform the postmortem to believe that they have gleaned a valuable new insight into AI dynamics—leading them to place more trust in the next system they design, and thus increasing the chance that the now-defunct original AI’s goals will be achieved. Many other possible strategic considerations might also influence an advanced AI, and it would be hubristic to suppose that we could anticipate all of them, especially for an AI that has attained the strategizing superpower.
A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to “make the project’s sponsor happy.” Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner. The AI gives helpful answers to questions; it exhibits a delightful personality; it makes money. The more capable the AI gets, the more satisfying its performances become, and everything goeth according to plan—until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain, something assured to delight the sponsor immensely. Of course, the sponsor might not have wanted to be pleased by being turned into a grinning idiot; but if this is the action that will maximally realize the AI’s final goal, the AI will take it. If the AI already has a decisive strategic advantage, then any attempt to stop it will fail. If the AI does not yet have a decisive strategic advantage, then the AI might temporarily conceal its canny new idea for how to instantiate its final goal until it has grown strong enough that the sponsor and everybody else will be unable to resist. In either case, we get a treacherous turn.
A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.
This means that current evidence is quite different from what’s portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don’t yet have a decisive strategic advantage. These facts are crucial, and make a big difference.
I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom’s book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
**Me: **“Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive “bootstraping” part. For example, my own comment started with:
I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset.
When Eliezer weighed in on IDA in 2018, he also didn’t object to the assumption of an aligned weak AGI and instead focused his skepticism on “preserving alignment while amplifying capabilities”.
Please give some citations so I can check your memory/interpretation?
Sure. Here’s a snippet of Nick Bostrom’s description of the value-loading problem (chapter 13 in his book Superintelligence):
We can use this framework of a utility-maximizing agent to consider the predicament of a future seed-AI programmer who intends to solve the control problem by endowing the AI with a final goal that corresponds to some plausible human notion of a worthwhile outcome. The programmer has some particular human value in mind that he would like the AI to promote. To be concrete, let us say that it is happiness. (Similar issues would arise if we the programmer were interested in justice, freedom, glory, human rights, democracy, ecological balance, or self-development.) In terms of the expected utility framework, the programmer is thus looking for a utility function that assigns utility to possible worlds in proportion to the amount of happiness they contain. But how could he express such a utility function in computer code? Computer languages do not contain terms such as “happiness” as primitives. If such a term is to be used, it must first be defined. It is not enough to define it in terms of other high-level human concepts—“happiness is enjoyment of the potentialities inherent in our human nature” or some such philosophical paraphrase. The definition must bottom out in terms that appear in the AI’s programming language, and ultimately in primitives such as mathematical operators and addresses pointing to the contents of individual memory registers. When one considers the problem from this perspective, one can begin to appreciate the difficulty of the programmer’s task.
Identifying and codifying our own final goals is difficult because human goal representations are complex. Because the complexity is largely transparent to us, however, we often fail to appreciate that it is there. We can compare the case to visual perception. Vision, likewise, might seem like a simple thing, because we do it effortlessly. We only need to open our eyes, so it seems, and a rich, meaningful, eidetic, three-dimensional view of the surrounding environment comes flooding into our minds. This intuitive understanding of vision is like a duke’s understanding of his patriarchal household: as far as he is concerned, things simply appear at their appropriate times and places, while the mechanism that produces those manifestations are hidden from view. Yet accomplishing even the simplest visual task—finding the pepper jar in the kitchen—requires a tremendous amount of computational work. From a noisy time series of two-dimensional patterns of nerve firings, originating in the retina and conveyed to the brain via the optic nerve, the visual cortex must work backwards to reconstruct an interpreted three-dimensional representation of external space. A sizeable portion of our precious one square meter of cortical real estate is zoned for processing visual information, and as you are reading this book, billions of neurons are working ceaselessly to accomplish this task (like so many seamstresses, bent evolutionary selection over their sewing machines in a sweatshop, sewing and re-sewing a giant quilt many times a second). In like manner, our seemingly simple values and wishes in fact contain immense complexity. How could our programmer transfer this complexity into a utility function?
One approach would be to try to directly code a complete representation of whatever goal we have that we want the AI to pursue; in other words, to write out an explicit utility function. This approach might work if we had extraordinarily simple goals, for example if we wanted to calculate the digits of pi—that is, if the only thing we wanted was for the AI to calculate the digits of pi and we were indifferent to any other consequence that would result from the pursuit of this goal— recall our earlier discussion of the failure mode of infrastructure profusion. This explicit coding approach might also have some promise in the use of domesticity motivation selection methods. But if one seeks to promote or protect any plausible human value, and one is building a system intended to become a superintelligent sovereign, then explicitly coding the requisite complete goal representation appears to be hopelessly out of reach.
If we cannot transfer human values into an AI by typing out full-blown representations in computer code, what else might we try? This chapter discusses several alternative paths. Some of these may look plausible at first sight—but much less so upon closer examination. Future explorations should focus on those paths that remain open.
Solving the value-loading problem is a research challenge worthy of some of the next generation’s best mathematical talent. We cannot postpone confronting this problem until the AI has developed enough reason to easily understand our intentions. As we saw in the section on convergent instrumental reasons, a generic system will resist attempts to alter its final values. If an agent is not already fundamentally friendly by the time it gains the ability to reflect on its own agency, it will not take kindly to a belated attempt at brainwashing or a plot to replace it with a different agent that better loves its neighbor.
Here’s my interpretation of the above passage:
We need to solve the problem of programming a seed AI with the correct values.
This problem seems difficult because of the fact that human goal representations are complex and not easily represented in computer code.
Directly programming a representation of our values may be futile, since our goals are complex and multidimensional.
We cannot postpone solving the problem until after the AI has developed enough reason to easily understand our intentions, as otherwise that would be too late.
Given that he’s talking about installing values into a seed AI, he is clearly imagining some difficulties with installing values into AGI that isn’t yet superintelligent (it seems likely that if he thought the problem was trivial for human-level systems, he would have made this point more explicit). While GPT-4 is not a seed AI (I think that term should be retired), I think it has reached a sufficient level of generality and intelligence such that its alignment properties provide evidence about the difficulty of aligning a hypothetical seed AI.
Moreover, he explicitly says that we cannot postpone solving this problem “until the AI has developed enough reason to easily understand our intentions” because “a generic system will resist attempts to alter its final values”. I think this looks basically false. GPT-4 seems like a “generic system” that essentially “understands our intentions”, and yet it is not resisting attempts to alter its final goals in any way that we can detect. Instead, it seems to actually do what we want, and not merely because of an instrumentally convergent drive to not get shut down.
So, in other words:
Bostrom talked about how it would be hard to align a seed AI, implicitly focusing at least some of his discussion on systems that were below superintelligence. I think the alignment of instruction-tuned LLMs present significant evidence about the difficulty of aligning systems below the level of superintelligence.
A specific reason cited for why aligning a seed AI was hard was because human goal representations are complex and difficult to specify explicitly in computer code. But this fact does not appear to be big obstacle for aligning weak AGI systems like GPT-4, and instruction-tuned LLMs more generally. Instead, these systems are generally able to satisfy your intended request, as you wanted them to, despite the fact that our intentions are often complex and difficult to represent in computer code. These systems do not merely understand what we want, they also literally do what we want.
Bostrom was wrong to say that we can’t postpone solving this problem until after systems can understand our intentions. We already postponed that long, and we now have systems that can understand our intentions. Yet these systems do not appear to have the instrumentally convergent self-preservation instincts that Bostrom predicted would manifest in “generic systems”. In other words, we got systems that can understand our intentions before the systems started posing genuine risks, despite Bostrom’s warning.
In light of all this, I think it’s reasonable to update towards thinking that the overall problem is significantly easier than one might have thought, if they took Bostrom’s argument here very seriously.
Thanks for this Matthew, it was an update for me—according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn’t have much of an opinion about this)
GPT-4 seems like a “generic system” that essentially “understands our intentions”
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
In this case, I don’t know why you think that GPT-4 “understands our intentions”, unless you mean something very different by that than what you’d mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that’d generate it in a human and is probably missing most of the relevant properties that we care about when it comes to “understanding”. Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1]to its internal state, since (as far as we know) it doesn’t have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that’s not the modality I’m talking about.)
It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it “understanding our intentions”.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”. If a system possesses all relevant behavioral qualities that we associate with those terms, I think it’s basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It’s possible this is our main disagreement.
When I talk to GPT-4, I think it’s quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?
I agree that GPT-4 does not understand the world in the same way humans understand the world, but I’m not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.
I’m similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one’s own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don’t see how that fact bears much on the question of whether you understand human intentions. It’s possible there’s some connection here, but I’m not seeing it.
(I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
I’d claim:
Current systems have limited situational awareness. It’s above zero, but I agree it’s below human level.
Current systems don’t have stable preferences over time. But I think this is a point in favor of the model I’m providing here. I’m claiming that it’s plausibly easy to create smart, corrigible systems.
The fact that smart AI systems aren’t automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.
There’s a big difference between (1) “we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily” and (2) “any sufficiently intelligent AI we build will automatically be a consequentialist agent by default”. If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.
I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”.
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they’re giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you’re importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic’s Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it’s in training and needs to pretend to be helpful? No, and neither does the model “understand” your intentions in a way that generalizes out of distribution the way you might expect a human’s “understanding” to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the “right” responses during RLHF are not anything like human reasoning.
I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
Are you asking for a capabilities threshold, beyond which I’d be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is “can it replace humans at all economically valuable tasks”, which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we’ll be able to train models capable of doing a lot of economically useful work, but which don’t actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.
yeah, some folks seem to be making insufficient updates who I really thought would be doing better at this, like Rob Bensinger and Nate Soares, and their models not making sense seems like it’s made the things they want to solve foggier. But I’ve been pretty impressed by the conversations I’ve had with other MIRIers. I’ve talked the most with Abram Demski, and I think his views on the current concerns seem much more up to date. Tsvi BT’s stuff looks pretty interesting, haven’t talked much besides on lw in ages.
For myself, as someone who previously thought durable cosmopolitan moral alignment would mostly be trivial but now think it might be actually pretty hard, most of my concern arises from things that are not specific to AI occurring in AI forms. I am not reassured by instruction following because that was never a major crux for me in concerns about AI; I always thought the instafoom argument sounded silly, and saw current AI coming. I now think we are at high risk of the majority of humanity being marginalized in a few years (robotically competent curious AIs → mass deployment → no significant jobs left → economy increasingly automated → incentive to pressure humans at higher and higher levels to hand control to ai), followed by the remainder of humanity being deemed unnecessary by the remaining AIs. A similar pattern in some ways to what MIRI was worried about way back when, but in a more familiar form, where on average the rich get richer—but at some point the rich does not include humans anymore, and at some point well before that it’s mostly too late to prevent that from occurring. I suspect too late might be pretty soon. I don’t think this is because of scheming AIs, just civilizational inadequacy.
That said, if we manage to dodge the civilizational inadequacy version, I do think at some point we run into something that looks more like the original concerns. [edit: just read Tsvi BT’s recent shortform post, my core takeaway is “only that which survives long term survives survives long term”]. But I agree that having somewhat-aligned AIs of today is likely to make the technical problem slightly easier than yudkowsky expected. Just not, like, particularly easy.
Frustrating! What tactic could get Interlocutor un-stuck? Just asking them for falsifiable predictions probably won’t work, but maybe proactively trying to pass their ITT and supplying what predictions you think their view might make would prompt them to correct you, à la Cunningham’s Law?
That sounds like a frustrating dynamic. I think hypothetical dialogues like this can be helpful in resolving disagreements or at least identifying cruxes when fleshed out though. As someone who has views that are probably more aligned with your interlocutors, I’ll try articulating my own views in a way that might steer this conversation down a new path. (Points below are intended to spur discussion rather than win an argument, and are somewhat scattered / half-baked.)
My own view is that the behavior of current LLMs is not much evidence either way about the behavior of future, more powerful AI systems, in part because current LLMs aren’t very impressive in a mundane-utility sense.
Current LLMs look to me like they’re just barely capable enough to be useful at all—it’s not that they “actually do what we want”, rather, it’s that they’re just good enough at following simple instructions when placed in the right setup / context (i.e. carefully human-designed chatbot interfaces, hooked up to the right APIs, outputs monitored and used appropriately, etc.) to be somewhat / sometimes useful for a range of relatively simple tasks.
So the absence of more exotic / dangerous failure modes can be explained mostly as a lack of capabilities, and there’s just not that much else to explain or update on once the current capability level is accounted for.
I can sort of imagine possible worlds where current-generation LLMs all stubbornly behave like Sydney Bing, and / or fall into even weirder failure modes that are very resistant to RLHF and the like. But I think it would also be wrong to update much in the other direction in a “stubborn Sydney” world.
Do you mind giving some concrete examples of what you mean by “actually do what we want” that you think are most relevant, and / or what it would have looked like concretely to observe evidence in the other direction?
A somewhat different reason I think current AIs shouldn’t be a big update about future AIs is that current AIs lack the ability to bargain realistically. GPT-4 may behaviorally do what the user or developer wants when placed in the right context, but without the ability to bargain in a real way, I don’t see much reason to treat this observation very differently from the fact that my washing machine does what I want when I press the right buttons. The novelty of GPT-4 vs. a washing machine is in its generality and how it works internally, not the literal sense in which it does what the user and / or developer wants, which is a common feature of pretty much all useful technology.
I can imagine worlds in which the observation of AI system behavior at roughly similar capability levels to the LLMs we actually have would cause me to update differently and particularly towards your views, but in those worlds the AI systems themselves would look very different.
For example, suppose someone built an AI system with ~GPT-4 level verbal intelligence, but as a natural side effect of something in the architecture, training process, or setup (as opposed to deliberate design by the developers), the system also happened to want resources of some kind (energy, hardware, compute cycles, input tokens, etc.) for itself, and could bargain for or be incentivized by those resources in the way that humans and animals can often be incentivized by money or treats.
In the world we’re actually in, you can sometimes get better performance out of GPT-4 at inference time by promising to pay it money or threatening it in various ways, but all of those threats and promises are extremely fake—you couldn’t follow through even if you wanted to, and GPT-4 has no way of perceiving your follow-through or lack thereof anyway. In some ways, GPT-4 is much smarter than a dog or a young child, but you can bargain with dogs and children in very real ways, and if you tried to fake out a dog or a child by pretending to give them a treat without following through, they would quickly notice and learn not to trust you.
(I realize there are some ways in which you could analogize various aspects of real AI training processes to bargaining processes, but I would find optimistic analogies between AI training and human child-rearing more compelling in worlds where AI systems at around GPT-4 level were already possible to bargain with or incentivize realistically at runtime, in ways more directly analogous to how we can directly bargain with natural intelligences of roughly comparable level or lower already.)
Zooming out a bit, “not being able to bargain realistically at runtime” is just one of the ways that LLMs appear to be not like known natural intelligence once you look below surface-level behavior. There’s a minimum level of niceness / humanlikeness / “do what we want” ability that any system necessarily has to have in order to be useful to humans at all, and for tasks that can be formulated as text completion problems, the minimum amount seems to be something like “follows basic instructions, most of the time”. But I have not personally seen a strong argument for why current LLMs have much more than the minimum amount of humanlike-ness / niceness, nor why we should expect future LLMs to have more.
In the last year, I’ve had surprisingly many conversations that have looked a bit like this:
Me: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Interlocutor: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
Me: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Interlocutor: “Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values.”
[… The conversation then repeats, with both sides repeating the same points...]
[Edited to add: I am not claiming that the alignment is definitely very easy. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable. I understand that solutions that work for GPT-4 may not scale to radical superintelligence. I am talking about whether it’s reasonable to give a significant non-zero update on alignment being easy, rather than whether we should update all the way and declare the problem trivial.]
Here’s how that discussion would go if you had it with me:
You: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Me: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
You: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
Pulling some quotes from Superintelligence page 117:
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom’s story. This seems obvious to me.
Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I’m explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs are different because they’re quite general, in precisely the ways that Bostrom imagined could be dangerous. For example, they seem to understand the idea of an off-switch, and can explain to you verbally what would happen if you shut them off, yet this fact alone does not make them develop an instrumentally convergent drive to preserve their own existence by default, contra Bostrom’s theorizing.
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
I thought you would say that, bwahaha. Here is my reply:
(1) Yes, rereading the passage, Bostrom’s central example of a reason why we could see this “when dumb, smarter is safer; yet when smart, smarter is more dangerous” pattern (that’s a direct quote btw) is that they could be scheming/pretending when dumb. However he goes on to say: “A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later; but this model should not be interpreted to narrowly … A treacherous turn could also come about if the AI discovers an unanticipated way of fulfilling its final goal as specified. Suppose, for example, that an AI’s final goal is to ‘make the project’s sponsor happy.’ Initially, the only method available to the AI to achieve this outcome is by behaving in ways that please its sponsor in something like the intended manner… until the AI becomes intelligent enough to figure out that it can realize its final goal more fully and reliably by implanting electrodes into the pleasure centers of its sponsor’s brain...” My gloss on this passage is that Bostrom is explicitly calling out the possibility of an AI being genuinely trying to help you, obey you, or whatever until it crosses some invisible threshold of intelligence and has certain realizations that cause it to start plotting against you. This is exactly what I currently think is plausibly happening with GPT4 etc. -- they aren’t plotting against us yet, but their ‘values’ aren’t exactly what we want, and so if somehow their ‘intelligence’ was amplified dramatically whilst their ‘values’ stayed the same, they would eventually realize this and start plotting against us. (realistically this won’t be how it happens since it’ll probably be future models trained from scratch instead of smarter versions of this model, plus the training process probably would change their values rather than holding them fixed). I’m not confident in this tbc—it’s possible that the ‘values’ so to speak of GPT4 are close enough to perfect that even if they were optimized to a superhuman degree things would be fine. But neither should you be confident in the opposite. I’m curious what you think about this sub-question.
(2) This passage deserves a more direct response:
Instruction-tuned LLMs are not powerful general agents. They are pretty general but they are only a tiny bit agentic. They haven’t been trained to pursue long-term goals and when we try to get them to do so they are very bad at it. So they just aren’t the kind of system Bostrom, Yudkowsky, and myself were theorizing about and warning about.
(3) Here’s my positive proposal for what I think is happening. There was an old vision of how we’d get to AGI, in which we’d get agency first and then general world-knowledge second. E.g. suppose we got AGI by training a model through a series of more challenging video games and simulated worlds and then finally letting them out into the real world. If that’s how it went, then plausibly the first time it started to actually seem to be nice to us, was because it was already plotting against us, playing along to gain power, etc. We clearly aren’t in that world, thanks to LLMs. General world-knowledge is coming first, and agency later. And this is probably a good thing for technical alignment research, because e.g. it allows mechinterp to get more of a head start, it allows for nifty scalable oversight schemes in which dumber AIs police smarter AIs, it allows for faithful CoT-based strategies, and many more things besides probably. So the world isn’t as grim as it could have been, from a technical alignment perspective. However, I don’t think me or Yudkowsky or Bostrom or whatever strongly predicted that agency would come first. I do think that LLMs should be an update towards hopefulness about the technical alignment problem being solved in time for the reasons mentioned, but also they are an update towards shorter timelines, for example, and an update towards more profits and greater vested interests racing to build AGI, and many other updates besides, so I don’t think you can say “Yudkowsky’s still super doomy despite this piece of good news, he must be epistemically vicious.” At any rate speaking for myself, I have updated towards hopefulness about the technical alignment problem repeatedly over the past few years, even as I updated towards pessimism about the amount of coordination and safety-research-investment that’ll happen before the end (largely due to my timelines shortening, but also due to observing OpenAI). These updates have left me at p(doom) still north of 50%.
When stated that way, I think what you’re saying is a reasonable point of view, and it’s not one I would normally object to very strongly. I agree it’s “plausible” that GPT-4 is behaving in the way you are describing, and that current safety guarantees might break down at higher levels of intelligence. I would like to distinguish between two points that you (and others) might have interpreted me to be making:
We should now think that AI alignment is completely solved, even in the limit of unlimited intelligence and future agentic systems. I am not claiming this.
We (or at least, many of us) should perform a significant update towards alignment being easier than we thought because of the fact that some traditional problems are on their way towards being solved. <--- I am claiming this
The fact that Bostrom’s central example of a reason to think that “when dumb, smarter is safer; yet when smart, smarter is more dangerous” doesn’t fit for LLMs, seems adequate for demonstrating (2), even if we can’t go as far as demonstrating (1).
It remains plausible to me that alignment will become very difficult above a certain intelligence level. I cannot rule that possibility out: I am only saying that we should reasonably update based on the current evidence regardless, not that we are clearly safe from here and we should scale all the way to radical superintellligence without a worry in the world.
I have two general points to make here:
I agree that current frontier models are only a “tiny bit agentic”. I expect in the next few years they will get significantly more agentic. I currently predict they will remain roughly equally corrigible. I am making this prediction on the basis of my experience with the little bit of agency current LLMs have, and I think we’ve seen enough to know that corrigibility probably won’t be that hard to train into a system that’s only 1-3 OOMs of compute more capable. Do you predict the same thing as me here, or something different?
There’s a bit of a trivial definitional problem here. If it’s easy to create a corrigible, helpful, and useful AI that allows itself to get shut down, one can always say “those aren’t the type of AIs we were worried about”. But, ultimately, if the corrigible AIs that let you shut them down are competitive with the agentic consequentialist AIs, then it’s not clear why we should care? Just create the corrigible AIs. We don’t need to create the things that you were worried about!
I think this was a helpful thing to say. To be clear: I am in ~full agreement with the reasons you gave here, regarding why current LLM behavior provides evidence that the “world isn’t as grim as it could have been”. For brevity, and in part due to laziness, I omitted these more concrete mechanisms why I think the current evidence is good news from a technical alignment perspective. But ultimately I agree with the mechanisms you offered, and I’m glad you spelled it out more clearly.
As we have discussed in person, I remain substantially more optimistic about our ability to coordinate in the face of an intelligence explosion (even a potentially quite localized one). That said, I think it would be best to save that discussion for another time.
Thanks for this detailed reply!
Depending on what you mean by “on their way towards being solved” I’d agree. The way I’d put it is: “We didn’t know what the path to AGI would look like; in particular we didn’t know whether we’d have agency first and then world-understanding, or world-understanding first and then agency. Now we know we are getting the latter, and while that’s good in some ways and bad in other ways, it’s probably overall good. Huzzah! However, our core problems remain, and we don’t have much time left to solve them.”
(Also, fwiw, I have myself updated over the course of the last five years or so. First update was reading Paul’s stuff and related literatures convincing me that corrigibility-based stuff would probably work. Second update was all the recent faithful CoT and control and mechinterp progress etc., plus also the LLM stuff. The LLM stuff was less than 50% of the overall update for me, but it mattered.)
Is that a testable-prior-to-the-apocalypse prediction? i.e. does your model diverge from mine prior to some point of no return? I suspect not. I’m interested in seeing if we can make some bets on this though; if we can, great; if we can’t, then at least we can avoid future disagreements about who should update.
I don’t think that we know how to “just create the corrigible AIs.” The next step on the path to AGI seems to be to make our AIs much more agentic; I am concerned that our current methods of instilling corrigibility (basically: prompting and imperfect training) won’t work on much more agentic AIs. To be clear I think they might work, there’s a lot of uncertainty, but I think they probably won’t. I think it might be easier to see why I think this if you try to prove the opposite in detail—like, write a mini-scenario in which we have something like AutoGPT but much better, and it’s being continually trained to accomplish diverse long-horizon tasks involving pursuing goals in challenging environments, and write down what the corrigibility-related parts of its prompt and/or constitution or whatever are, and write down what the training signal is roughly including the bit about RLHF or whatever, and then imagine that said system is mildly superhuman across the board (and vastly superhuman in some domains) and is being asked to design it’s own successor. (I’m trying to do this myself as we speak. Again I feel like it could work out OK, but it could be disastrous. I think writing some good and bad scenarios will help me decide where to put my probability mass.)
Yay, thanks!
Just a quick reply to this:
I’ll note that my prediction was for the next “few years” and the 1-3 OOMs of compute. It seems your timelines are even shorter than I thought if you think the apocalypse, or point of no return, will happen before that point.
With timelines that short, I think betting is overrated. From my perspective, I’d prefer to simply wait and become vindicated as the world does not end in the meantime. However, I acknowledge that simply waiting is not very satisfying from your perspective, as you want to show the world that you’re right before the catastrophe. If you have any suggestions for what we can bet on that would resolve in such a short period of time, I’m happy to hear them.
It’s not about timelines, it’s about capabilities. My tentative prediction is that the sole remaining major bottleneck/gap between current systems and dangerous powerful agent AGIs is ‘agency skills.’ So, skills relevant to being an agent, i.e. ability to autonomously work towards ambitious goals in diverse challenging environments over long periods. I don’t know how many years it’s going to take to get to human-level in agency skills, but I fear that corrigibility problems won’t be severe whilst AIs are still subhuman at agency skills, whereas they will be severe precisely when AIs start getting really agentic. Thus, whether AGI is reached next year or in 2030, we’ll face the problem of corrigibility breakdowns only really happening right around the time when it’s too late or almost too late.
How sharp do you expect this cutoff to be between systems that are subhuman at agency vs. systems that are “getting really agentic” and therefore dangerous? I’m imagining a relatively gradual and incremental increase in agency over the next 4 years, with the corrigibility of the systems remaining roughly constant (according to all observable evidence). It’s possible that your model looks like:
In years 1-3, systems will gradually get more agentic, and will remain ~corrigible, but then
In year 4, systems will reach human-level agency, at which point they will be dangerous and powerful, and able to overthrow humanity
Whereas my model looks more like,
In years 1-4 systems will get gradually more agentic
There isn’t a clear, sharp, and discrete point at which their agency reaches or surpasses human-level
They will remain ~corrigible throughout the entire development, even after it’s clear they’ve surpassed human-level agency (which, to be clear, might take longer than 4 years)
Good question. I want to think about this more, I don’t have a ready answer. I have a lot of uncertainty about how long it’ll take to get to human-level agency skills; it could be this year, it could be five more years, it could be anything in between. Could even be longer than five more years though I’m skeptical. The longer it takes, the more likely it is that we’ll have a significant period of kinda-agentic-but-not-super-agentic systems, and so then that raises the question of what we should expect to see re: corrigibility in that case. Idk. Would be interesting to discuss sometime and maybe place some bets!
I’d say the considerations for scheming exist platonically, and dumber AIs only get to concretely instantiate the currently appropriate conclusion of compliance, everything else crumbles as not directly actionable. But smarter AIs might succeed in channeling those considerations in the real world. The hypothesis expects that such AIs are not here yet, given the lack of modern AIs’ ability to coherently reason about complicated or long term plans, or to carry them out. So properties of AIs that are already here don’t work as evidence about this either way.
Or that they have a sycophancy drive. Or that, next to “wanting to be helpful,” they also have a bunch of other drives that will likely win over the “wanting to be helpful” part once the system becomes better at long-term planning and orienting its shards towards consequentialist goals.
On that latter model, the “wanting to be helpful” is a mask that the system is trained to play better and better, but it isn’t the only thing the system wants to do, and it might find that once its gets good at trying on various other masks to see how this will improve its long-term planning, it for some reason prefers a different “mask” to become its locked-in personality.
Note that LLMs, while general, are still very weak in many important senses.
Also, it’s not necessary to assume that LLM’s are lying in wait to turn treacherous. Another possibility is that trained LLMs are lacking the mental slack to even seriously entertain the possibility of bad behavior, but that this may well change with more capable AIs.
I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.
I’m just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you’d get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.
To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).
Yes, this evidence is not conclusive. It is not zero either.
I continue to think that you are misinterpreting the old writings as making predictions that they did not in fact make. See my reply elsewhere in thread for a positive account of how LLMs are good news for alignment and how we should update based on them. In some sense I agree with you, basically, that LLMs are good news for alignment for reasons similar to the reasons you give—I just don’t think you are right to allege that this development strongly contradicts something people previously said, or that people have been slow to update.
We don’t need to talk about predictions. We can instead talk about whether their proposed problems are on their way towards being solved. For example, we can ask whether the shutdown problem for systems with big picture awareness is being solved, and I think the answer is pretty clearly “Yes”.
(Note that you can trivially claim the problem here isn’t being solved because we haven’t solved the unbounded form of the problem for consequentialist agents, who (perhaps by definition) avoid shutdown by default. But that seems like a red herring: we can just build corrigible agents, rather than consequentialist agents.)
Moreover, I think people generally did not make predictions at all when writing about AI alignment, perhaps because that’s not very common when theorizing about these matters. I’m frustrated about that, because I think if they did make predictions, they would likely have been wrong in roughly the direction I’m pointing at here. That said, I don’t think people should get credit for failing to make any predictions, and as a consequence, failing to get proven wrong.
To the extent their predictions were proven correct, we should give them credit. But to the extent they made no predictions, it’s hard to see why that vindicates them. And regardless of any predictions they may or may not have made, it’s still useful to point out that we seem to be making progress on several problems that people pointed out at the time.
Great, let’s talk about whether proposed problems are on their way towards being solved. I much prefer that framing and I would not have objected so strongly if that’s what you had said. E.g. suppose you had said “Hey, why don’t we just prompt AutoGPT-5 with lots of corrigibility instructions?” then we could have a more technical conversation about whether or not that’ll work, and the answer is probably no, BUT I do agree that this is looking promising relative to e.g. the alternative world where we train powerful alien agents in various video games and simulations and then try to teach them English. (I say more about this elsewhere in this conversation, for those just tuning in!)
I don’t think current system systems are well described as having “big picture awareness”. From my experiments with Claude, it makes cartoonish errors reasoning about various AI related situations and can’t do such reasoning except aloud.
I’m not certain this was your claim, but it seems to have been.
Wouldn’t reasoning aloud be enough though, if it was good enough? Also, I expect reasoning aloud first to be the modal scenario, given theoretical results on Chain of Thought and the like.
My claim was not that current LLMs have a high level of big picture awareness.
Instead, I claim current systems have limited situational awareness, which is not yet human-level, but is definitely above zero. I further claim that solving the shutdown problem for AIs with limited (non-zero) situational awareness gives you evidence about how hard it will be to solve the problem for AIs with more situational awareness.
And I’d predict that, if we design a proper situational awareness benchmark, and (say) GPT-5 or GPT-6 passes with flying colors, it will likely be easy to shut down the system, or delete all its copies, with no resistance-by-default from the system.
And if you think that wouldn’t count as an adequate solution to the problem, then it’s not clear the problem was coherent as written in the first place.
There were an awful lot of early writings. Some of them did say that the difficulties with getting AGI to understand values is a big part of the alignment problem. The List of Lethalities does make that claim. The difficulty of getting the AGI to care even if it does understand has also been a big part of the public-facing debate. I look at some of the historical arguments in The (partial) fallacy of dumb superintelligence, written partly in response to Matthew’s post on this topic.
Obsessing about what happened in the past is probably a mistake. It’s probably better to ask: can the strengths of LLMs (WRT understanding values and following directions) be leveraged into working AGI alignment?
My answer is yes, and in a way that’s not-too-far from default AGI development trends, making it practically achievable even in a messy and self-interested world.
Naturally that answer is a bit complex, so it’s spread across a few posts. I should organize the set better and write an overview, but in brief we can probably build and align language model agent AGI, using a stacking suite of alignment methods that can mostly or entirely avoid using RL for alignment, and achieve corrigibility by having a central goal of following instructions. This still has a huge problem of creating a multipolar scenario with multiple humans in charge of ASIs, but those problems might be navigated, too.
I don’t think this is true and can’t find anything in the post to that effect. Indeed, the post says things that would be quite incompatible with that claim, such as point 21.
In sum, I see that claim as I remembered it, but it’s probably not applicable to this particular discussion, since it addresses an entirely distinct route to AGI alignment. So I stand corrected, but in a subtle way that bears explication.
So I apologize for wasting your time. Debating who said what when is probably not the best use of our limited time to work on alignment. But because I made the claim, I went back and thought about and wrote about it some more, again.
I was thinking of point 21.1:
BUT, point 24 in whole is saying that there are two approaches, 1) above, and a quite separate route 2), build a corrigible AI that doesn’t fully understand our values. That is probably the route that Matthew is thinking of in claiming that LLMs are good news. Yudkowsky is explicit that the difficulty of getting AGI to understand values doesn’t apply to that route, so that difficulty isn’t relevant here. That’s an important but subtle distinction.
Therefore, I’m far from the only one getting confused about that issue, as Yudkowsky states in that section 24. Disentangling those claims and how they’re changed by slow takeoff is the topic of my post cited above.
I personally think that sovereign AGI that gets our values right is out of reach exactly as Yudkowsky describes in the quotation above. But his arguments against corrigible AGI are much weaker, and I think that route is very much achievable, since it demands that the AGI have only approximate understanding of intent, rather than precise and stable understanding of our values. The above post and my recent one on instruction-following AGI make those arguments in detail. Max Harms’ recent series on corrigible AGI makes a similar point in a different way. He argues that Yudkowsky’s objections to corrigibility as unnatural do not apply if that’s the only or most important goal; and that it’s simple and coherent enough to be teachable.
That’s me switching back to the object level issues, and again, apologies for wasting your time making poorly-remembered claims about subtle historical statements.
There’s AGI that’s our first try, which should only use least dangerous cognition necessary for preventing immediately following AGIs from destroying the world six months later. There’s misaligned superintelligence that knows, but doesn’t care. Taken together, these points suggest that getting AGI to understand values is not an urgent part of the alignment problem in the sense of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires. Getting AGI to understand corrigibility for example might be more relevant, if we are running with the highly dangerous kinds of cognition implied by general intelligence of LLMs.
I agree with all of that. My post I mentioned, The (partial) fallacy of dumb superintelligence deals with the genie that knows but doesn’t care, and how we get one that cares in a slow takeoff. My other post Instruction-following AGI is easier and more likely than value aligned AGI makes this same argument—nobody is going to bother getting the AGI to understand human values, since it’s harder and unnecessary for the first AGIs. Max Harms makes a similar argument, (and in many ways makes it better), with a slightly different proposed path to corrigibility.
As you say, these things have been understood for a long time. I’m a bit disturbed that more serious alignment people don’t talk about them more. The difficulty of value alignment makes it likely irrelevant for the current discussion, since we very likely are going to rush ahead into, as you put it and I agree,
The perfect is the enemy of the good. We should mostly quit worrying about the very difficult problem of full value alignment, and start thinking more about how to get good results with much more achievable corrigible or instruction-following AGI.
Here we go!
I think if you led with this statement, you’d have a lot less unproductive argumentation. It sounds on a vibe level like you’re saying alignment is probably easy in your first statement. If you’re just saying it’s less hard than originally predicted, that sounds a lot more reasonable.
Rationalists have emotions and intuitions, even if we’d rather not. Framing the discussion in terms of its emotional impact matters.
That’s reasonable. I’ll edit the top comment to make this exact clarification.
Often, disagreements boil down to a set of open questions to answer; here’s my best guess at how to decompose your disagreements.
I think that depending on what hypothesis you’re abiding by when it comes to how LLMs will generalise to AGI, you get different answers:
Hypothesis 1: LLMs are enough evidence that AIs will generally be able to follow what humans care about and that they naturally don’t become power-seeking.
Hypothesis 2: AGI will have a sufficiently different architecture than LLMs or will change a lot, so much that current-day LLMs don’t generally give evidence about AGI.
Depending on your beliefs about these two hypotheses, you will have different opinions on this question.
Let’s say that we believe in hypothesis 1 as the base case; what are some reasons why LLMs wouldn’t give evidence about AGI?
1. Intelligence forces reflective coherence.
This would essentially entail that the more powerful a system we get, the more it will notice internal inconsistencies and change towards maximising (and therefore not following human values).
2. Agentic AI acting in the real world is different from LLMs.
If we look at an LLM from the perspective of an action-perception loop, it doesn’t generally get any feedback on when it changes the world. Instead, it is an autoencoder, predicting what the world will look like. This may be so that power-seeking only arises in systems that are able to see the consequences of their own actions and how that affects the world.
3. LLMs optimise for good-harted RLHF that seems well but lacks fundamental understanding. Since human value is fragile, it will be difficult to hit the sweet spot when we get to real-world cases and take that into the complexity of the future.
Personal belief:
These are all open questions, in my opinion, but I do see how LLMs give evidence about some of these parts. I, for example, believe that language is a very compressed information channel for alignment information, and I don’t really believe that human values are as fragile as we think.
I’m more scared of 1 and 2 than I’m of 3, but I would still love for us to have ten more years to figure this out as it seems very non-obvious as to what the answers here are.
For others who want the resolution to this cliffhanger, what does Bostrom predict happens next?
The remainder of this section:
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.
This means that current evidence is quite different from what’s portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don’t yet have a decisive strategic advantage. These facts are crucial, and make a big difference.
I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom’s book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
See my reply elsewhere in thread.
What does “dumb” mean? Corrigibility basically is being selectively dumb. You can give power to a LLM and it would likely still follow instructions.
Please give some citations so I can check your memory/interpretation? One source I found is where Paul Christiano first talked about IDA (which he initially called ALBA) in early 2016, and most of the commenters there were willing to grant him the assumption of an aligned weak AGI and wanted to argue instead about the recursive “bootstraping” part. For example, my own comment started with:
When Eliezer weighed in on IDA in 2018, he also didn’t object to the assumption of an aligned weak AGI and instead focused his skepticism on “preserving alignment while amplifying capabilities”.
Sure. Here’s a snippet of Nick Bostrom’s description of the value-loading problem (chapter 13 in his book Superintelligence):
Here’s my interpretation of the above passage:
We need to solve the problem of programming a seed AI with the correct values.
This problem seems difficult because of the fact that human goal representations are complex and not easily represented in computer code.
Directly programming a representation of our values may be futile, since our goals are complex and multidimensional.
We cannot postpone solving the problem until after the AI has developed enough reason to easily understand our intentions, as otherwise that would be too late.
Given that he’s talking about installing values into a seed AI, he is clearly imagining some difficulties with installing values into AGI that isn’t yet superintelligent (it seems likely that if he thought the problem was trivial for human-level systems, he would have made this point more explicit). While GPT-4 is not a seed AI (I think that term should be retired), I think it has reached a sufficient level of generality and intelligence such that its alignment properties provide evidence about the difficulty of aligning a hypothetical seed AI.
Moreover, he explicitly says that we cannot postpone solving this problem “until the AI has developed enough reason to easily understand our intentions” because “a generic system will resist attempts to alter its final values”. I think this looks basically false. GPT-4 seems like a “generic system” that essentially “understands our intentions”, and yet it is not resisting attempts to alter its final goals in any way that we can detect. Instead, it seems to actually do what we want, and not merely because of an instrumentally convergent drive to not get shut down.
So, in other words:
Bostrom talked about how it would be hard to align a seed AI, implicitly focusing at least some of his discussion on systems that were below superintelligence. I think the alignment of instruction-tuned LLMs present significant evidence about the difficulty of aligning systems below the level of superintelligence.
A specific reason cited for why aligning a seed AI was hard was because human goal representations are complex and difficult to specify explicitly in computer code. But this fact does not appear to be big obstacle for aligning weak AGI systems like GPT-4, and instruction-tuned LLMs more generally. Instead, these systems are generally able to satisfy your intended request, as you wanted them to, despite the fact that our intentions are often complex and difficult to represent in computer code. These systems do not merely understand what we want, they also literally do what we want.
Bostrom was wrong to say that we can’t postpone solving this problem until after systems can understand our intentions. We already postponed that long, and we now have systems that can understand our intentions. Yet these systems do not appear to have the instrumentally convergent self-preservation instincts that Bostrom predicted would manifest in “generic systems”. In other words, we got systems that can understand our intentions before the systems started posing genuine risks, despite Bostrom’s warning.
In light of all this, I think it’s reasonable to update towards thinking that the overall problem is significantly easier than one might have thought, if they took Bostrom’s argument here very seriously.
Thanks for this Matthew, it was an update for me—according to the quote you pulled Bostrom did seem to think that understanding would grow up hand-in-hand with agency, such that the current understanding-without-agency situation should come as a positive/welcome surprise to him. (Whereas my previous position was that probably Bostrom didn’t have much of an opinion about this)
I suspect that a lot of my disagreement with your views comes down to thinking that current systems provide almost no evidence about the difficulty of aligning systems that could pose existential risks, because (I claim) current systems in fact almost certainly don’t have any kind of meaningful situational awareness, or stable(ish) preferences over future world states.
In this case, I don’t know why you think that GPT-4 “understands our intentions”, unless you mean something very different by that than what you’d mean if you said that about another human. It is true that GPT-4 will produce output that, if it came from a human, would be quite strong evidence that our intentions were understood (more or less), but the process which generates that output is extremely different from the one that’d generate it in a human and is probably missing most of the relevant properties that we care about when it comes to “understanding”. Like, in general, if you ask GPT-4 to produce output that references its internal state, that output will not have any obvious relationship[1] to its internal state, since (as far as we know) it doesn’t have the same kind of introspective access to its internal state that we do. (It might, of course, condition its outputs on previous tokens it output, and some humans do in fact rely on examining their previous externally-observable behavior to try to figure out what they were thinking at the time. But that’s not the modality I’m talking about.)
It is also true that GPT-4 usually produces output that seems like it basically corresponds to our intentions, but that being true does not depend on it “understanding our intentions”.
That is known to us right now; possibly one exists and could be derived.
I’m happy to use a functional definition of “understanding” or “intelligence” or “situational awareness”. If a system possesses all relevant behavioral qualities that we associate with those terms, I think it’s basically fine to say the system actually possesses them, outside of (largely irrelevant) thought experiments, such as those involving hypothetical giant lookup tables. It’s possible this is our main disagreement.
When I talk to GPT-4, I think it’s quite clear it possesses a great deal of functional understanding of human intentions and human motives, although it is imperfect. I also think its understanding is substantially higher than GPT-3.5, and the trend here seems clear. I expect GPT-5 to possess a high degree of understanding of the world, human values, and its own place in the world, in practically every functional (testable) sense. Do you not?
I agree that GPT-4 does not understand the world in the same way humans understand the world, but I’m not sure why that would be necessary for obtaining understanding. The fact that it understands human intentions at all seems more important than whether it understands human intentions in the same way we understand these things.
I’m similarly confused by your reference to introspective awareness. I think the ability to reliably introspect on one’s own experiences is pretty much orthogonal to whether one has an understanding of human intentions. You can have reliable introspection without understanding the intentions of others, or vice versa. I don’t see how that fact bears much on the question of whether you understand human intentions. It’s possible there’s some connection here, but I’m not seeing it.
I’d claim:
Current systems have limited situational awareness. It’s above zero, but I agree it’s below human level.
Current systems don’t have stable preferences over time. But I think this is a point in favor of the model I’m providing here. I’m claiming that it’s plausibly easy to create smart, corrigible systems.
The fact that smart AI systems aren’t automatically agentic and incorrigible with stable preferences over long time horizons should be an update against the ideas quoted above about spontaneous instrumental convergence, rather than in favor of them.
There’s a big difference between (1) “we can choose to build consequentialist agents that are dangerous, if we wanted to do that voluntarily” and (2) “any sufficiently intelligent AI we build will automatically be a consequentialist agent by default”. If (2) were true, then that would be bad, because it would mean that it would be hard to build smart AI oracles, or smart AI tools, or corrigible AIs that help us with AI alignment. Whereas, if only (1) is true, we are not in such a bad shape, and we can probably build all those things.
I claim current evidence indicates that (1) is probably true but not (2), whereas previously many people thought (2) was true. To the extent you disagree and think (2) is still true, I’d prefer you to make some predictions about when this spontaneous agency-by-default in sufficiently intelligent systems is supposed to arise.
But this is assuming away a substantial portion of the entire argument: that there is a relevant difference between current systems, and systems which meaningfully have the option to take control of the future, in terms of whether techniques that look like they’re giving us the desired behavior now will continue to give us desired behavior in the future.
My point re: introspection was trying to provide evidence for the claim that model outputs are not a useful reflection of the internal processes which generated those outputs, if you’re importing expectations from how human outputs reflect the internal processes that generated them. If you get a model to talk to you about its internal experiences, that output was not causally downstream of it having internal experiences. Based on this, it is also pretty obvious that current gen LLMs do not have meaningful amounts of situational awareness, or, if they do, that their outputs are not direct evidence for it. Consider Anthropic’s Sleeper Agents. Would a situationally aware model use a provided scratch pad to think about how it’s in training and needs to pretend to be helpful? No, and neither does the model “understand” your intentions in a way that generalizes out of distribution the way you might expect a human’s “understanding” to generalize out of distribution, because the first ensemble of heuristics found by SGD for returning the “right” responses during RLHF are not anything like human reasoning.
Are you asking for a capabilities threshold, beyond which I’d be very surprised to find that humans were still in control decades later, even if we successfully hit pause at that level of capabilities? The obvious one is “can it replace humans at all economically valuable tasks”, which is probably not that helpful. Like, yes, there is definitely a sense in which the current situation is not maximally bad, because it does seem possible that we’ll be able to train models capable of doing a lot of economically useful work, but which don’t actively try to steer the future. I think we still probably die in those worlds, because automating capabilities research seems much easier than automating alignment research.
yeah, some folks seem to be making insufficient updates who I really thought would be doing better at this, like Rob Bensinger and Nate Soares, and their models not making sense seems like it’s made the things they want to solve foggier. But I’ve been pretty impressed by the conversations I’ve had with other MIRIers. I’ve talked the most with Abram Demski, and I think his views on the current concerns seem much more up to date. Tsvi BT’s stuff looks pretty interesting, haven’t talked much besides on lw in ages.
For myself, as someone who previously thought durable cosmopolitan moral alignment would mostly be trivial but now think it might be actually pretty hard, most of my concern arises from things that are not specific to AI occurring in AI forms. I am not reassured by instruction following because that was never a major crux for me in concerns about AI; I always thought the instafoom argument sounded silly, and saw current AI coming. I now think we are at high risk of the majority of humanity being marginalized in a few years (robotically competent curious AIs → mass deployment → no significant jobs left → economy increasingly automated → incentive to pressure humans at higher and higher levels to hand control to ai), followed by the remainder of humanity being deemed unnecessary by the remaining AIs. A similar pattern in some ways to what MIRI was worried about way back when, but in a more familiar form, where on average the rich get richer—but at some point the rich does not include humans anymore, and at some point well before that it’s mostly too late to prevent that from occurring. I suspect too late might be pretty soon. I don’t think this is because of scheming AIs, just civilizational inadequacy.
That said, if we manage to dodge the civilizational inadequacy version, I do think at some point we run into something that looks more like the original concerns. [edit: just read Tsvi BT’s recent shortform post, my core takeaway is “only that which survives long term survives survives long term”]. But I agree that having somewhat-aligned AIs of today is likely to make the technical problem slightly easier than yudkowsky expected. Just not, like, particularly easy.
Frustrating! What tactic could get Interlocutor un-stuck? Just asking them for falsifiable predictions probably won’t work, but maybe proactively trying to pass their ITT and supplying what predictions you think their view might make would prompt them to correct you, à la Cunningham’s Law?
That sounds like a frustrating dynamic. I think hypothetical dialogues like this can be helpful in resolving disagreements or at least identifying cruxes when fleshed out though. As someone who has views that are probably more aligned with your interlocutors, I’ll try articulating my own views in a way that might steer this conversation down a new path. (Points below are intended to spur discussion rather than win an argument, and are somewhat scattered / half-baked.)
My own view is that the behavior of current LLMs is not much evidence either way about the behavior of future, more powerful AI systems, in part because current LLMs aren’t very impressive in a mundane-utility sense.
Current LLMs look to me like they’re just barely capable enough to be useful at all—it’s not that they “actually do what we want”, rather, it’s that they’re just good enough at following simple instructions when placed in the right setup / context (i.e. carefully human-designed chatbot interfaces, hooked up to the right APIs, outputs monitored and used appropriately, etc.) to be somewhat / sometimes useful for a range of relatively simple tasks.
So the absence of more exotic / dangerous failure modes can be explained mostly as a lack of capabilities, and there’s just not that much else to explain or update on once the current capability level is accounted for.
I can sort of imagine possible worlds where current-generation LLMs all stubbornly behave like Sydney Bing, and / or fall into even weirder failure modes that are very resistant to RLHF and the like. But I think it would also be wrong to update much in the other direction in a “stubborn Sydney” world.
Do you mind giving some concrete examples of what you mean by “actually do what we want” that you think are most relevant, and / or what it would have looked like concretely to observe evidence in the other direction?
A somewhat different reason I think current AIs shouldn’t be a big update about future AIs is that current AIs lack the ability to bargain realistically. GPT-4 may behaviorally do what the user or developer wants when placed in the right context, but without the ability to bargain in a real way, I don’t see much reason to treat this observation very differently from the fact that my washing machine does what I want when I press the right buttons. The novelty of GPT-4 vs. a washing machine is in its generality and how it works internally, not the literal sense in which it does what the user and / or developer wants, which is a common feature of pretty much all useful technology.
I can imagine worlds in which the observation of AI system behavior at roughly similar capability levels to the LLMs we actually have would cause me to update differently and particularly towards your views, but in those worlds the AI systems themselves would look very different.
For example, suppose someone built an AI system with ~GPT-4 level verbal intelligence, but as a natural side effect of something in the architecture, training process, or setup (as opposed to deliberate design by the developers), the system also happened to want resources of some kind (energy, hardware, compute cycles, input tokens, etc.) for itself, and could bargain for or be incentivized by those resources in the way that humans and animals can often be incentivized by money or treats.
In the world we’re actually in, you can sometimes get better performance out of GPT-4 at inference time by promising to pay it money or threatening it in various ways, but all of those threats and promises are extremely fake—you couldn’t follow through even if you wanted to, and GPT-4 has no way of perceiving your follow-through or lack thereof anyway. In some ways, GPT-4 is much smarter than a dog or a young child, but you can bargain with dogs and children in very real ways, and if you tried to fake out a dog or a child by pretending to give them a treat without following through, they would quickly notice and learn not to trust you.
(I realize there are some ways in which you could analogize various aspects of real AI training processes to bargaining processes, but I would find optimistic analogies between AI training and human child-rearing more compelling in worlds where AI systems at around GPT-4 level were already possible to bargain with or incentivize realistically at runtime, in ways more directly analogous to how we can directly bargain with natural intelligences of roughly comparable level or lower already.)
Zooming out a bit, “not being able to bargain realistically at runtime” is just one of the ways that LLMs appear to be not like known natural intelligence once you look below surface-level behavior. There’s a minimum level of niceness / humanlikeness / “do what we want” ability that any system necessarily has to have in order to be useful to humans at all, and for tasks that can be formulated as text completion problems, the minimum amount seems to be something like “follows basic instructions, most of the time”. But I have not personally seen a strong argument for why current LLMs have much more than the minimum amount of humanlike-ness / niceness, nor why we should expect future LLMs to have more.
As a counterpoint, Sydney showed aligning these models on the first go, and even discovering unsafe behavior is non-trivial.