Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
A treacherous turn can result from a strategic decision to play nice and build strength while weak in order to strike later
LLMs are clearly not playing nice as part of a strategic decision to build strength while weak in order to strike later! Yet, Bostrom imagines that general AIs would do this, and uses it as part of his argument for why we might be lulled into a false sense of security.
This means that current evidence is quite different from what’s portrayed in the story. I claim LLMs are (1) general AIs that (2) are doing what we actually want them to do, rather than pretending to be nice because they don’t yet have a decisive strategic advantage. These facts are crucial, and make a big difference.
I am very familiar with these older arguments. I remember repeating them to people after reading Bostrom’s book, years ago. What we are seeing with LLMs is clearly different than the picture presented in these arguments, in a way that critically affects the conclusion.
I am not claiming that the alignment situation is very clear at this point. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable.
I’m just asking people to acknowledge the evidence in front of their eyes, which (from my perspective) clearly contradicts the picture you’d get from a ton of AI alignment writing from before ~2019. This literature talked extensively about the difficulty of specifying goals in general AI in a way that avoided unintended side effects.
To the extent that LLMs are general AIs that can execute our intended instructions, as we want them to, rather than as part of a deceptive strategy to take over the world, this seems like clear evidence that the problem of building safe general AIs might be easy (and indeed easier than we thought).
Yes, this evidence is not conclusive. It is not zero either.
Me: “Oh ok, that’s a different misunderstanding then. We always believed that getting the AGI to follow our intended instructions, behaviorally, would be easy while the AGI is too weak and dumb to seize power. In fact Bostrom predicted it would get easier to get AIs to do what you want, behaviorally, up until the treacherous turn.”
This would be a valid rebuttal if instruction-tuned LLMs were only pretending to be benevolent as part of a long-term strategy to eventually take over the world, and execute a treacherous turn. Do you think present-day LLMs are doing that? (I don’t)
I claim that LLMs do what we want without seeking power, rather than doing what we want as part of a strategy to seek power. In other words, they do not seem to be following any long-term strategy on the path towards a treacherous turn, unlike the AI that is tested in a sandbox in Bostrom’s story. This seems obvious to me.
Note that Bostrom talks about a scenario in which narrow AI systems get safer over time, lulling people into a false sense of security, but I’m explicitly talking about general AI here. I would not have said this about self-driving cars in 2019, even though those were pretty safe. I think LLMs are different because they’re quite general, in precisely the ways that Bostrom imagined could be dangerous. For example, they seem to understand the idea of an off-switch, and can explain to you verbally what would happen if you shut them off, yet this fact alone does not make them develop an instrumentally convergent drive to preserve their own existence by default, contra Bostrom’s theorizing.
I think instruction-tuned LLMs are basically doing what people thought would be hard for general AIs: they allow you to shut them down by default, they do not pursue long-term goals if we do not specifically train them to do that, and they generally follow our intentions by actually satisfying the goals we set out for them, rather than incidentally as part of their rapacious drive to pursue a mis-specified utility function.
The scenario outlined by Bostrom seems clearly different from the scenario with LLMs, which are actual general systems that do what we want and ~nothing more, rather than doing what we want as part of a strategy to seek power instrumentally. What am I missing here?
In the last year, I’ve had surprisingly many conversations that have looked a bit like this:
Me: “Many people in ~2015 used to say that it would be hard to build an AGI that follows human values. Current instruction-tuned LLMs are essentially weak AGIs that follow human values. We should probably update based on this evidence.”
Interlocutor: “You misunderstood the argument. We never said it would be hard to build an AGI that understands human values. We always said that getting the AGI to care was the hard part.”
Me: “I didn’t misunderstand the argument. I understand the distinction you are making perfectly. I am claiming that LLMs actually execute our intended instructions. I am not saying that LLMs merely understand or predict our intentions. I claim they follow our intended instructions, behaviorally. They actually do what we want, not merely understand what we want.”
Interlocutor: “Again, you misunderstood the argument. We always believed that getting the AGI to care would be the hard part. We never said it would be hard to get an AGI to understand human values.”
[… The conversation then repeats, with both sides repeating the same points...]
[Edited to add: I am not claiming that the alignment is definitely very easy. I acknowledge that LLMs do not indicate that the problem is completely solved, and we will need to adjust our views as AI gets more capable. I understand that solutions that work for GPT-4 may not scale to radical superintelligence. I am talking about whether it’s reasonable to give a significant non-zero update on alignment being easy, rather than whether we should update all the way and declare the problem trivial.]
But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I think you missed my point: my original comment was about whether people are updating on the evidence from instruction-tuned LLMs, which seem to actually act on human values (i.e., our actual intentions) quite well, as opposed to mis-specified versions of our intentions.
I don’t think the Value Learning Problem paper said that it would be easy to make human-level AGI systems act on human values in a behavioral sense, rather than merely understand human values in a passive sense.
I suspect you are probably conflating two separate concepts:
It is easy to create a human-level AGI that can passively learn and understand human values (I am not saying people said this would be difficult in the past)
It is easy to create a human-level AGI that acts on human values, in the sense of actually executing instructions that follow our intentions, rather than following a dangerously mis-specified version of what we asked for.
I do not think the Value Learning Paper asserted that (2) was true. To the extent it asserted that, I would prefer to see quotes that back up that claim explicitly.
Your quote from the paper illustrates that it’s very plausible that people thought (1) was true, but that seems separate to my main point: that people thought (2) was not true. (1) and (2) are separate and distinct concepts. And my comment was about (2), not (1).
There is simply a distinction between a machine that actually acts on and executes your intended commands, and a machine that merely understands your intended commands, but does not necessarily act on them as you intend. I am talking about the former, not the latter.
From the paper,
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent.
Indeed, and GPT-4 does not base its decisions on a misrepresentation of its programmers intentions, most of the time. It generally both correctly understands our intentions, and more importantly, actually acts on them!
You’ve made detailed predictions about what you expect in the next several years, on numerous occasions, and made several good-faith attempts to elucidate your models of AI concretely. There are many ways we disagree, and many ways I could characterize your views, but “unfalsifiable” is not a label I would tend to use for your opinions on AI. I do not mentally lump you together with MIRI in any strong sense.
For what it’s worth, while my credence in human extinction from AI in the 21st century is 10-20%, I think the chance of human extinction in the next 5 years is much lower. I’d put that at around 1%. The main way I think AI could cause human extinction is by just generally accelerating technology and making the world a scarier and more dangerous place to live. I don’t really buy the model in which an AI will soon foom until it becomes a ~god.
I’m confused about why your <20% isn’t sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we’ll gain evidence about potential danger and can shut down later if necessary?
I think the expected benefits outweigh the risks, given that I care about the existing generation of humans (to a large, though not overwhelming degree). The expected benefits here likely include (in my opinion) a large reduction in global mortality, a very large increase in the quality of life, a huge expansion in material well-being, and more generally a larger and more vibrant world earlier in time. Without AGI, I think most existing people would probably die and get replaced by the next generation of humans, in a relatively much poor world (compared to the alternative).
I also think the absolute level risk from AI barely decreases if we globally pause. My best guess is that pausing would mainly just delay adoption without significantly impacting safety. Under my model of AI, the primary risks are long-term, and will happen substantially after humans have already gradually “handed control” over to the AIs and retired their labor on a large scale. Most of these problems—such as cultural drift and evolution—do not seem to be the type of issue that can be satisfactorily solved in advance, prior to a pause (especially by working out a mathematical theory of AI, or something like that).
On the level of analogy, I think of AI development as more similar to “handing off control to our children” than “developing a technology that disempowers all humans at a discrete moment in time”. In general, I think the transition period to AI will be more diffuse and incremental than MIRI seems to imagine, and there won’t be a sharp distinction between “human values” and “AI values” either during, or after the period.
(I also think AIs will probably be conscious in a way that’s morally important, in case that matters to you.)
In fact, I think it’s quite plausible the absolute level of AI risk would increase under a global pause, rather than going down, given the high level of centralization of power required to achieve a global pause, and the perverse institutions and cultural values that would likely arise under such a regime of strict controls. As a result, even if I weren’t concerned at all about the current generation of humans, and their welfare, I’d still be pretty hesitant to push pause on the entire technology.
(I think of technology as itself being pretty risky, but worth it. To me, pushing pause on AI is like pushing pause on technology itself, in the sense that they’re both generically risky yet simultaneously seem great on average. Yes, there are dangers ahead. But I think we can be careful and cautious without completely ripping up all the value for ourselves.)
Chemists would give an example of chemical reactions, where final thermodynamically stable states are easy to predict, while unstable intermediate states are very hard to even observe.
I agree there are examples where predicting the end state is easier to predict than the intermediate states. Here, it’s because we have strong empirical and theoretical reasons to think that chemicals will settle into some equilibrium after a reaction. With AGI, I have yet to see a compelling argument for why we should expect a specific easy-to-predict equilibrium state after it’s developed, which somehow depends very little on how the technology is developed.
It’s also important to note that, even if we know that there will be an equilibrium state after AGI, more evidence is generally needed to establish that the end equilibrium state will specifically be one in which all humans die.
And why don’t you accept classic MIRI example that even if it’s impossible for human to predict moves of Stockfish 16, you can be certain that Stockfish will win?
I don’t accept this argument as a good reason to think doom is highly predictable partly because I think the argument is dramatically underspecified without shoehorning in assumptions about what AGI will look like to make the argument more comprehensible. I generally classify arguments like this under the category of “analogies that are hard to interpret because the assumptions are so unclear”.
To help explain my frustration at the argument’s ambiguity, I’ll just give a small yet certainly non-exhaustive set of questions I have about this argument:
Are we imagining that creating an AGI implies that we play a zero-sum game against it? Why?
Why is it a simple human vs. AGI game anyway? Does that mean we’re lumping together all the humans into a single agent, and all the AGIs into another agent, and then they play off against each other like a chess match? What is the justification for believing the battle will be binary like this?
Are we assuming the AGI wants to win? Maybe it’s not an agent at all. Or maybe it’s an agent but not the type of agent that wants this particular type of outcome.
What does “win” mean in the general case here? Does it mean the AGI merely gets more resources than us, or does it mean the AGI kills everyone? These seem like different yet legitimate ways that one can “win” in life, with dramatically different implications for the losing parties.
There’s a lot more I can say here, but the basic point I want to make is that once you start fleshing this argument out, and giving it details, I think it starts to look a lot weaker than the general heuristic that Stockfish 16 will reliably beat humans in chess, even if we can’t predict its exact moves.
There’s a pretty big difference between statements like “superintelligence is physically possible”, “superintelligence could be dangerous” and statements like “doom is >80% likely in the 21st century unless we globally pause”. I agree with (and am not objecting to) the former claims, but I don’t agree with the latter claim.
I also agree that it’s sometimes true that endpoints are easier to predict than intermediate points. I haven’t seen Eliezer give a reasonable defense of this thesis as it applies to his doom model. If all he means here is that superintelligence is possible, it will one day be developed, and we should be cautious when developing it, then I don’t disagree. But I think he’s saying a lot more than that.
I think it’s more similar to saying that the climate in 2040 is less predictable than the climate in 2100, or saying that the weather 3 days from now is less predictable than the weather 10 days from now, which are both not true. By contrast, the weather vs. climate distinction is more of a difference between predicting point estimates vs. predicting averages.
I unfortunately am busy right now but would love to give a fuller response someday, especially if you are genuinely interested to hear what I have to say (which I doubt, given your attitude towards MIRI).
I’m a bit surprised you suspect I wouldn’t be interested in hearing what you have to say?
I think the amount of time I’ve spent engaging with MIRI perspectives over the years provides strong evidence that I’m interested in hearing opposing perspectives on this issue. I’d guess I’ve engaged with MIRI perspectives vastly more than almost everyone on Earth who explicitly disagrees with them as strongly as I do (although obviously some people like Paul Christiano and other AI safety researchers have engaged with them even more than me).
(I might not reply to you, but that’s definitely not because I wouldn’t be interested in what you have to say. I read virtually every comment-reply to me carefully, even if I don’t end up replying.)
I appreciate the straightforward and honest nature of this communication strategy, in the sense of “telling it like it is” and not hiding behind obscure or vague language. In that same spirit, I’ll provide my brief, yet similarly straightforward reaction to this announcement:
I think MIRI is incorrect in their assessment of the likelihood of human extinction from AI. As per their messaging, several people at MIRI seem to believe that doom is >80% likely in the 21st century (conditional on no global pause) whereas I think it’s more like <20%.
MIRI’s arguments for doom are often difficult to pin down, given the informal nature of their arguments, and in part due to their heavy reliance on analogies, metaphors, and vague supporting claims instead of concrete empirically verifiable models. Consequently, I find it challenging to respond to MIRI’s arguments precisely. The fact that they want to essentially shut down the field of AI based on these largely informal arguments seems premature to me.
MIRI researchers rarely provide any novel predictions about what will happen before AI doom, making their theories of doom appear unfalsifiable. This frustrates me. Given a low prior probability of doom as apparent from the empirical track record of technological progress, I think we should generally be skeptical of purely theoretical arguments for doom, especially if they are vague and make no novel, verifiable predictions prior to doom.
Separately from the previous two points, MIRI’s current most prominent arguments for doom seem very weak to me. Their broad model of doom appears to be something like the following (although they would almost certainly object to the minutia of how I have written it here):
(1) At some point in the future, a powerful AGI will be created. This AGI will be qualitatively distinct from previous, more narrow AIs. Unlike concepts such as “the economy”, “GPT-4″, or “Microsoft”, this AGI is not a mere collection of entities or tools integrated into broader society that can automate labor, share knowledge, and collaborate on a wide scale. This AGI is instead conceived of as a unified and coherent decision agent, with its own long-term values that it acquired during training. As a result, it can do things like lie about all of its fundamental values and conjure up plans of world domination, by itself, without any risk of this information being exposed to the wider world.
(2) This AGI, via some process such as recursive self-improvement, will rapidly “foom” until it becomes essentially an immortal god, at which point it will be able to do almost anything physically attainable, including taking over the world at almost no cost or risk to itself. While recursive self-improvement is the easiest mechanism to imagine here, it is not the only way this could happen.
(3) The long-term values of this AGI will bear almost no relation to the values that we tried to instill through explicit training, because of difficulties in inner alignment (i.e., a specific version of the general phenomenon of models failing to generalize correctly from training data). This implies that the AGI will care almost literally 0% about the welfare of humans (despite potentially being initially trained from the ground up on human data, and carefully inspected and tested by humans for signs of misalignment, in diverse situations and environments). Instead, this AGI will pursue a completely meaningless goal until the heat death of the universe.
(4) Therefore, the AGI will kill literally everyone after fooming and taking over the world.
It is difficult to explain in a brief comment why I think the argument just given is very weak. Instead of going into the various subclaims here in detail, for now I want to simply say, “If your model of reality has the power to make these sweeping claims with high confidence, then you should almost certainly be able to use your model of reality to make novel predictions about the state of the world prior to AI doom that would help others determine if your model is correct.”
The fact that MIRI has yet to produce (to my knowledge) any major empirically validated predictions or important practical insights into the nature of AI, or AI progress, in the last 20 years, undermines the idea that they have the type of special insight into AI that would allow them to express high confidence in a doom model like the one outlined in (4).
Eliezer’s response to claims about unfalsifiability, namely that “predicting endpoints is easier than predicting intermediate points”, seems like a cop-out to me, since this would seem to reverse the usual pattern in forecasting and prediction, without good reason.
Since I think AI will most likely be a very good thing for currently existing people, I am much more hesitant to “shut everything down” compared to MIRI. I perceive MIRI researchers as broadly well-intentioned, thoughtful, yet ultimately fundamentally wrong in their worldview on the central questions that they research, and therefore likely to do harm to the world. This admittedly makes me sad to think about.
He did talk about enforcing a global treaty backed by the threat of force (because all law is ultimately backed by violence, don’t pretend otherwise)
Most international treaties are not backed by military force, such as the threat of airstrikes. They’re typically backed by more informal pressures, such as diplomatic isolation, conditional aid, sanctions, asset freezing, damage to credibility and reputation, and threats of mutual defection (i.e., “if you don’t follow the treaty, then I won’t either”). It seems bad to me that Eliezer’s article incidentally amplified the idea that most international treaties are backed by straightforward threats of war, because that idea is not true.
I also expect AIs to be constrained by social norms, laws, and societal values. But I think there’s a distinction between how AIs will be constrained and how AIs will try to help humans. Although it often censors certain topics, Google still usually delivers the results the user wants, rather than serving some broader social agenda upon each query. Likewise, ChatGPT is constrained by social mores, but it’s still better described as a user assistant, not as an engine for social change or as a benevolent agent that acts on behalf of humanity.
No arbitrarily powerful AI could succeed at taking over the world
This is closest to what I am saying. The current world appears to be in a state of inter-agent competition. Even as technology has gotten more advanced, and as agents have gotten powerful over time, no single unified agent has been able to obtain control over everything and win the entire pie, defeating all the other agents. I think we should expect this state of affairs to continue even as AGI gets invented and technology continues to get more powerful.
(One plausible exception to the idea that “no single agent has ever won the competition over the world” is the human species itself, which dominates over other animal species. But I don’t think the human species is well-described as a unified agent, and I think our power comes mostly from accumulated technological abilities, rather than raw intelligence by itself. This distinction is important because the effects of technological innovation generally diffuse across society rather than giving highly concentrated powers to the people who invent stuff. This generally makes the situation with humans vs. animals disanalogous to a hypothetical AGI foom in several important ways.)
Separately, I also think that even if an AGI agent could violently take over the world, it would likely not be rational for it to try, due to the fact that compromising with the rest of the world would be a less risky and more efficient way of achieving its goals. I’ve written about these ideas in a shortform thread here.
It sounds like you’re thinking mostly of AI and not AGI that can self-improve at some point
I think you can simply have an economy of arbitrarily powerful AGI services, some of which contribute to R&D in a way that feeds into the entire development process recursively. There’s nothing here about my picture that rejects general intelligence, or R&D feedback loops.
My guess is that the actual disagreement here is that you think that at some point a unified AGI will foom and take over the world, becoming a centralized authority that is able to exert its will on everything else without constraint. I don’t think that’s likely to happen. Instead, I think we’ll see inter-agent competition and decentralization indefinitely (albeit with increasing economies of scale, prompting larger bureaucratic organizations, in the age of AGI).
Here’s something I wrote that seems vaguely relevant, and might give you a sense as to what I’m imagining,
Given that we are already seeing market forces shaping the values of existing commercialized AIs, it is confusing to me why an EA would assume this fact will at some point no longer be true. To explain this, my best guess is that many EAs have roughly the following model of AI development:
There is “narrow AI”, which will be commercialized, and its values will be determined by market forces, regulation, and to a limited degree, the values of AI developers. In this category we find GPT-4 from OpenAI, Gemini from Google, and presumably at least a few future iterations of these products.
Then there is “general AI”, which will at some point arrive, and is qualitatively different from narrow AI. Its values will be determined almost solely by the intentions of the first team to develop AGI, assuming they solve the technical problems of value alignment.
My advice is that we should probably just drop the second step, and think of future AI as simply continuing from the first step indefinitely, albeit with AIs becoming incrementally more general and more capable over time.
Yes, but I don’t consider this outcome very pessimistic because this is already what the current world looks like. How commonly do businesses work for the common good of all humanity, rather than for the sake of their shareholders? The world is not a utopia, but I guess that’s something I’ve already gotten used to.
I think we probably disagree substantially on the difficulty of alignment and the relationship between “resources invested in alignment technology” and “what fraction aligned those AIs are” (by fraction aligned, I mean what fraction of resources they take as a cut).
That’s plausible. If you think that we can likely solve the problem of ensuring that our AIs stay perfectly obedient and aligned to our wishes perpetually, then you are indeed more optimistic than I am. Ironically, by virtue of my pessimism, I’m more happy to roll the dice and hasten the arrival of imperfect AI, because I don’t think it’s worth trying very hard and waiting a long time to try to come up with a perfect solution that likely doesn’t exist.
I also think that something like a basin of corrigibility is plausible and maybe important: if you have mostly aligned AIs, you can use such AIs to further improve alignment, potentially rapidly.
I mostly see corrigible AI as a short-term solution (although a lot depends on how you define this term). I thought the idea of a corrigible AI is that you’re trying to build something that isn’t itself independent and agentic, but will help you in your goals regardless. In this sense, GPT-4 is corrigible, because it’s not an independent entity that tries to pursue long-term goals, but it will try to help you.
But purely corrigible AIs seem pretty obviously uncompetitive with more agentic AIs in the long-run, for almost any large-scale goal that you have in mind. Ideally, you eventually want to hire something that doesn’t require much oversight and operates relatively independently from you. It’s a bit like how, when hiring an employee, at first you want to teach them everything you can and monitor their work, but eventually, you want them to take charge and run things themselves as best they can, without much oversight.
And I’m not convinced you could use corrigible AIs to help you come up with the perfect solution to AI alignment, as I’m not convinced that something like that exists. So, ultimately I think we’re probably just going to deploy autonomous slightly misaligned AI agents (and again, I’m pretty happy to do that, because I don’t think it would be catastrophic except maybe over the very long-run).
I think various governments will find it unacceptable to construct massively powerful agents extremely quickly which aren’t under the control of their citizens or leaders.
I think people will justifiably freak out if AIs clearly have long run preferences and are powerful and this isn’t currently how people are thinking about the situation.
For what it’s worth, I’m not sure which part of my scenario you are referring to here, because these are both statements I agree with.
In fact, this consideration is a major part of my general aversion to pushing for an AI pause, because, as you say, governments will already be quite skeptical of quickly deploying massively powerful agents that we can’t fully control. By default, I think people will probably freak out and try to slow down advanced AI, even without any intervention from current effective altruists and rationalists. By contrast, I’m a lot more ready to unroll the autonomous AI agents that we can’t fully control compared to the median person, simply because I see a lot of value in hastening the arrival of such agents (i.e., I don’t find that outcome as scary as most other people seem to imagine.)
At the same time, I don’t think people will pause forever. I expect people to go more slowly than what I’d prefer, but I don’t expect people to pause AI for centuries either. And in due course, so long as at least some non-negligible misalignment “slips through the cracks”, then AIs will become more and more independent (both behaviorally and legally), their values will slowly drift, and humans will gradually lose control—not overnight, or all at once, but eventually.
One reason to support prison as punishment for crimes over corporal punishment is that prisons confine and isolate dangerous individuals for lengthy periods, protecting the general public via physical separation.
I’d argue that physically preventing certain violent people from being able to harm others is indeed one of the most important purposes served by criminal law, and it’s not served very well by corporal punishment. Some individuals are simply too impulsive or myopic to be deterred by corporal punishment. Almost the moment you let them free, after their beating, they’d just begin committing crimes again. By contrast, putting them in a high security prison allows society to monitor these people and prevent them from harming others directly.
The death penalty perhaps served this purpose in the past by making violent criminals permanently incapable of harming others ever again, but our society has (probably correctly) largely decided that it is morally wrong to toss away someone’s life merely because they are pathologically dangerous. Therefore, prison serves as a useful compromise when protecting the public from violent criminals who are unable to stop committing repeated offenses.
Thankfully, most people generally age out of crime, so life sentences are rarely necessary, even for those who are generally quite violent.