Communications @ MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer’s. (Though we agree about an awful lot.)
Rob Bensinger
MIRI’s 2024 End-of-Year Update
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)
This paragraph doesn’t seem like an honest summary to me. Eliezer’s position in the dialogue, as I understood it, was:
The journey is a lot harder to predict than the destination. Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”. Eliezer isn’t claiming to have secret insights about the detailed year-to-year or month-to-month changes in the field; if he thought that, he’d have been making those near-term tech predictions already back in 2010, 2015, or 2020 to show that he has this skill.
From Eliezer’s perspective, Paul is claiming to know a lot about the future trajectory of AI, and not just about the endpoints: Paul thinks progress will be relatively smooth and continuous, and thinks it will get increasingly smooth and continuous as time passes and more resources flow into the field. Eliezer, by contrast, expects the field to get choppier as time passes and we get closer to ASI.
A way to bet on this, which Eliezer repeatedly proposed but wasn’t able to get Paul to do very much, would be for Paul to list out a bunch of concrete predictions that Paul sees as “yep, this is what smooth and continuous progress looks like”. Then, even though Eliezer doesn’t necessarily have a concrete “nope, the future will go like X instead of Y” prediction, he’d be willing to bet against a portfolio of Paul-predictions: when you expect the future to be more unpredictable, you’re willing to at least weakly bet against any sufficiently ambitious pool of concrete predictions.
(Also, if Paul generated a ton of predictions like that, an occasional prediction might indeed make Eliezer go “oh wait, I do have a strong prediction on that question in particular; I didn’t realize this was one of our points of disagreement”. I don’t think this is where most of the action is, but it’s at least a nice side-effect of the person-who-thinks-this-tech-is-way-more-predictable spelling out predictions.)
Eliezer was also more interested in trying to reach mutual understanding of the views on offer, as opposed to bet let’s bet on things immediately never mind the world-views. But insofar as Paul really wanted to have the bets conversation instead, Eliezer sunk an awful lot of time into trying to find operationalizations Paul and he could bet on, over many hours of conversation.
If your end-point take-away from that (even after actual bets were in fact made, and tons of different high-level predictions were sketched out) is “wow how dare Eliezer be so unwilling to make bets on anything”, then I feel a lot less hope that world-models like Eliezer’s (“long-term outcome is more predictable than the detailed year-by-year tech pathway”) are going to be given a remotely fair hearing.
(Also, in fairness to Paul, I’d say that he spent a bunch of time working with Eliezer to try to understand the basic methodologies and foundations for their perspectives on the world. I think both Eliezer and Paul did an admirable job going back and forth between the thing Paul wanted to focus on and the thing Eliezer wanted to focus on, letting us look at a bunch of different parts of the elephant. And I don’t think it was unhelpful for Paul to try to identify operationalizations and bets, as part of the larger discussion; I just disagree with TurnTrout’s summary of what happened.)
If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
Quoting the abstract of MIRI’s “The Value Learning Problem” paper (emphasis added):
Autonomous AI systems’ programmed goals can easily fall short of programmers’ intentions. Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. We discuss early ideas on how one might design smarter-than-human AI systems that can inductively learn what to value from labeled training data, and highlight questions about the construction of systems that model and act upon their operators’ preferences.
And quoting from the first page of that paper:
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.1 Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.
I won’t weigh in on how many LessWrong posts at the time were confused about where the core of the problem lies. But “The Value Learning Problem” was one of the seven core papers in which MIRI laid out our first research agenda, so I don’t think “we’re centrally worried about things that are capable enough to understand what we want, but that don’t have the right goals” was in any way hidden or treated as minor back in 2014-2015.
I also wouldn’t say “MIRI predicted that NLP will largely fall years before AI can match e.g. the best human mathematicians, or the best scientists”, and if we saw a way to leverage that surprise to take a big bite out of the central problem, that would be a big positive update.
I’d say:
MIRI mostly just didn’t make predictions about the exact path ML would take to get to superintelligence, and we’ve said we didn’t expect this to be very predictable because “the journey is harder to predict than the destination”. (Cf. “it’s easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be”.)
Back in 2016-2017, I think various people at MIRI updated to median timelines in the 2030-2040 range (after having had longer timelines before that), and our timelines haven’t jumped around a ton since then (though they’ve gotten a little bit longer or shorter here and there).
So in some sense, qualitatively eyeballing the field, we don’t feel surprised by “the total amount of progress the field is exhibiting”, because it looked in 2017 like the field was just getting started, there was likely an enormous amount more you could do with 2017-style techniques (and variants on them) than had already been done, and there was likely to be a lot more money and talent flowing into the field in the coming years.
But “the total amount of progress over the last 7 years doesn’t seem that shocking” is very different from “we predicted what that progress would look like”. AFAIK we mostly didn’t have strong guesses about that, though I think it’s totally fine to say that the GPT series is more surprising to the circa-2017 MIRI than a lot of other paths would have been.
(Then again, we’d have expected something surprising to happen here, because it would be weird if our low-confidence visualizations of the mainline future just happened to line up with what happened. You can expect to be surprised a bunch without being able to guess where the surprises will come from; and in that situation, there’s obviously less to be gained from putting out a bunch of predictions you don’t particularly believe in.)
Pre-deep-learning-revolution, we made early predictions like “just throwing more compute at the problem without gaining deep new insights into intelligence is less likely to be the key thing that gets us there”, which was falsified. But that was a relatively high-level prediction; post-deep-learning-revolution we haven’t claimed to know much about how advances are going to be sequenced.
We have been quite interested in hearing from others about their advance prediction record: it’s a lot easier to say “I personally have no idea what the qualitative capabilities of GPT-2, GPT-3, etc. will be” than to say ”… and no one else knows either”, and if someone has an amazing track record at guessing a lot of those qualitative capabilities, I’d be interested to hear about their further predictions. We’re generally pessimistic that “which of these specific systems will first unlock a specific qualitative capability?” is particularly predictable, but this claim can be tested via people actually making those predictions.
But the benefit of a Pause is that you use the extra time to do something in particular. Why wouldn’t you want to fiscally sponsor research on problems that you think need to be solved for the future of Earth-originating intelligent life to go well?
MIRI still sponsors some alignment research, and I expect we’ll sponsor more alignment research directions in the future. I’d say MIRI leadership didn’t have enough aggregate hope in Agent Foundations in particular to want to keep supporting it ourselves (though I consider its existence net-positive).
My model of MIRI is that our main focus these days is “find ways to make it likelier that a halt occurs” and “improve the world’s general understanding of the situation in case this helps someone come up with a better idea”, but that we’re also pretty open to taking on projects in all four of these quadrants, if we find something that’s promising and that seems like a good fit at MIRI (or something promising that seems unlikely to occur if it’s not housed at MIRI):
AI alignment work Non-alignment work High-EV absent a pause High-EV given a pause
I don’t find this convincing. I think the target “dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment” is narrow (or just nonexistent, using the methods we’re likely to have on hand).
Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.
It’s also important to keep in mind that on Leopold’s model (and my own), these problems need to be solved under a ton of time pressure. To maintain a lead, the USG in Leopold’s scenario will often need to figure out some of these “under what circumstances can we trust this highly novel system and believe its alignment answers?” issues in a matter of weeks or perhaps months, so that the overall alignment project can complete in a very short window of time. This is not a situation where we’re imagining having a ton of time to develop mastery and deep understanding of these new models. (Or mastery of the alignment problem sufficient to verify when a new idea is on the right track or not.)
one positive feature it does have, it proposes to rely on a multitude of “limited weakly-superhuman artificial alignment researchers” and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable.
I don’t find this convincing. I think the target “dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment” is narrow (or just nonexistent, using the methods we’re likely to have on hand).
Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.
You and Leopold seem to share the assumption that huge GPU farms or equivalently strong compute are necessary for superintelligence.
Nope! I don’t assume that.
I do think that it’s likely the first world-endangering AI is trained using more compute than was used to train GPT-4; but I’m certainly not confident of that prediction, and I don’t think it’s possible to make reasonable predictions (given our current knowledge state) about how much more compute might be needed.
(“Needed” for the first world-endangeringly powerful AI humans actually build, that is. I feel confident that you can in principle build world-endangeringly powerful AI with far less compute than was used to train GPT-4; but the first lethally powerful AI systems humans actually build will presumably be far from the limits of what’s physically possible!)
But what would happen if one effectively closes that path? There will be huge selection pressure to look for alternative routes, to invest more heavily in those algorithmic breakthroughs which can work with modest GPU power or even with CPUs.
Agreed. This is why I support humanity working on things like human enhancement and (plausibly) AI alignment, in parallel with working on an international AI development pause. I don’t think that a pause on its own is a permanent solution, though if we’re lucky and the laws are well-designed I imagine it could buy humanity quite a few decades.
I hope people will step back from solely focusing on advocating for policy-level prescriptions (as none of the existing policy-level prescriptions look particularly promising at the moment) and invest some of their time in continuing object-level discussions of AI existential safety without predefined political ends.
FWIW, MIRI does already think of “generally spreading reasonable discussion of the problem, and trying to increase the probability that someone comes up with some new promising idea for addressing x-risk” as a top organizational priority.
The usual internal framing is some version of “we have our own current best guess at how to save the world, but our idea is a massive longshot, and not the sort of basket humanity should put all its eggs in”. I think “AI pause + some form of cognitive enhancement” should be a top priority, but I also consider it a top priority for humanity to try to find other potential paths to a good future.
As a start, you can prohibit sufficiently large training runs. This isn’t a necessary-and-sufficient condition, and doesn’t necessarily solve the problem on its own, and there’s room for debate about how risk changes as a function of training resources. But it’s a place to start, when the field is mostly flying blind about where the risks arise; and choosing a relatively conservative threshold makes obvious sense when failing to leave enough safety buffer means human extinction. (And when algorithmic progress is likely to reduce the minimum dangerous training size over time, whatever it is today—also a reason the cap is likely to need to lower over time to some extent, until we’re out of the lethally dangerous situation we currently find ourselves in.)
Alternatively, they either don’t buy the perils or believes there’s a chance the other chance may not?
If they “don’t buy the perils”, and the perils are real, then Leopold’s scenario is falsified and we shouldn’t be pushing for the USG to build ASI.
If there are no perils at all, then sure, Leopold’s scenario and mine are both false. I didn’t mean to imply that our two views are the only options.
Separately, Leopold’s model of “what are the dangers?” is different from mine. But I don’t think the dangers Leopold is worried about are dramatically easier to understand than the dangers I’m worried about (in the respective worlds where our worries are correct). Just the opposite: the level of understanding you need to literally solve alignment for superintelligences vastly exceeds the level you need to just be spooked by ASI and not want it to be built. Which is the point I was making; not “ASI is axiomatically dangerous”, but “this doesn’t count as a strike against my plan relative to Leopold’s, and in fact Leopold is making a far bigger ask of government than I am on this front”.
Nuclear war essentially has a localized p(doom) of 1
I don’t know what this means. If you’re saying “nuclear weapons kill the people they hit”, I don’t see the relevance; guns also kill the people they hit, hut that doesn’t make a gun strategically similar to a smarter-than-human AI system.
Yep, I had in mind AI Forecasting: One Year In.
Why? 95% risk of doom isn’t certainty, but seems obviously more than sufficient.
For that matter, why would the USG want to build AGI if they considered it a coinflip whether this will kill everyone or not? The USG could choose the coinflip, or it could choose to try to prevent China from putting the world at risk without creating that risk itself. “Sit back and watch other countries build doomsday weapons” and “build doomsday weapons yourself” are not the only two options.
Leopold’s scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don’t have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence, and the lives of all its citizens on a half-baked mad science initiative, when it could simply work with its allies to block the tech’s development and maintain the status quo at minimal risk.
Success in this scenario requires a weird combination of USG prescience with self-destructiveness: enough foresight to see what’s coming, but paired with a weird compulsion to race to build the very thing that puts its existence at risk, when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.
Responding to Matt Reardon’s point on the EA Forum:
Leopold’s implicit response as I see it:
Convincing all stakeholders of high p(doom) such that they take decisive, coordinated action is wildly improbable (“step 1: get everyone to agree with me” is the foundation of many terrible plans and almost no good ones)
Still improbable, but less wildly, is the idea that we can steer institutions towards sensitivity to risk on the margin and that those institutions can position themselves to solve the technical and other challenges ahead
Maybe the key insight is that both strategies walk on a knife’s edge. While Moore’s law, algorithmic improvement, and chip design hum along at some level, even a little breakdown in international willpower to enforce a pause/stop can rapidly convert to catastrophe. Spending a lot of effort to get that consensus also has high opportunity cost in terms of steering institutions in the world where the effort fails (and it is very likely to fail). [...]
Three high-level reasons I think Leopold’s plan looks a lot less workable:
It requires major scientific breakthroughs to occur on a very short time horizon, including unknown breakthroughs that will manifest to solve problems we don’t understand or know about today.
These breakthroughs need to come in a field that has not been particularly productive or fast in the past. (Indeed, forecasters have been surprised by how slowly safety/robustness/etc. have progressed in recent years, and simultaneously surprised by the breakneck speed of capabilities.)
It requires extremely precise and correct behavior by a giant government bureaucracy that includes many staff who won’t be the best and brightest in the field — inevitably, many technical and nontechnical people in the bureaucracy will have wrong beliefs about AGI and about alignment.
The “extremely precise and correct behavior” part means that we’re effectively hoping to be handed an excellent bureaucracy that will rapidly and competently solve a thirty-digit combination lock requiring the invention of multiple new fields and the solving of a variety of thorny and poorly-understood technical problems — in many cases, on Leopold’s view, in a space of months or weeks. This seems… not like how the real world works.
It also separately requires that various guesses about the background empirical facts all pan out. Leopold can do literally everything right and get the USG fully on board and get the USG doing literally everything correctly by his lights — and then the plan ends up destroying the world rather than saving it because it just happened to turn out that ASI was a lot more compute-efficient to train than he expected, resulting in the USG being unable to monopolize the tech and unable to achieve a sufficiently long lead time.
My proposal doesn’t require qualitatively that kind of success. It requires governments to coordinate on banning things. Plausibly, it requires governments to overreact to a weird, scary, and publicly controversial new tech to some degree, since it’s unlikely that governments will exactly hit the target we want. This is not a particularly weird ask; governments ban things (and coordinate or copy-paste each other’s laws) all the time, in far less dangerous and fraught areas than AGI. This is “trying to get the international order to lean hard in a particular direction on a yes-or-no question where there’s already a lot of energy behind choosing ‘no’”, not “solving a long list of hard science and engineering problems in a matter of months and getting a bureaucracy to output the correct long string of digits to nail down all the correct technical solutions and all the correct processes to find those solutions”.
The CCP’s current appetite for AGI seems remarkably small, and I expect them to be more worried that an AGI race would leave them in the dust (and/or put their regime at risk, and/or put their lives at risk), than excited about the opportunity such a race provides. Governments around the world currently, to the best of my knowledge, are nowhere near advancing any frontiers in ML.
From my perspective, Leopold is imagining a future problem into being (“all of this changes”) and then trying to find a galaxy-brained incredibly complex and assumption-laden way to wriggle out of this imagined future dilemma, when the far easier and less risky path would be to not have the world powers race in the first place, have them recognize that this technology is lethally dangerous (something the USG chain of command, at least, would need to fully internalize on Leopold’s plan too), and have them block private labs from sending us over the precipice (again, something Leopold assumes will happen) while not choosing to take on the risk of destroying themselves (nor permitting other world powers to unilaterally impose that risk).
(Though he also has an incentive to not die.)
Response to Aschenbrenner’s “Situational Awareness”
As is typical for Twitter, we also signal-boosted a lot of other people’s takes. Some non-MIRI people whose social media takes I’ve recently liked include Wei Dai, Daniel Kokotajlo, Jeffrey Ladish, Patrick McKenzie, Zvi Mowshowitz, Kelsey Piper, and Liron Shapira.
The stuff I’ve been tweeting doesn’t constitute an official MIRI statement — e.g., I don’t usually run these tweets by other MIRI folks, and I’m not assuming everyone at MIRI agrees with me or would phrase things the same way. That said, some recent comments and questions from me and Eliezer:
May 17: Early thoughts on the news about OpenAI’s crazy NDAs.
May 24: Eliezer flags that GPT-4o can now pass one of Eliezer’s personal ways of testing whether models are still bad at math.
May 29: My initial reaction to hearing Helen’s comments on the TED AI podcast. Includes some follow-on discussion of the ChatGPT example, etc.
May 30: A conversation between me and Emmett Shear about the version of events he’d tweeted in November. (Plus a comment from Eliezer.)
May 30: Eliezer signal-boosting a correction from Paul Graham.
June 4: Eliezer objects to Aschenbrenner’s characterization of his timelines argument as open-and-shut “believing in straight lines on a graph”.
Every protest I’ve witnessed seemed to be designed to annoy and alienate its witnesses, making it as clear as possible that there was no way to talk to these people, that their minds were on rails. I think most people recognize that as cult shit and are alienated by that.
In the last year, I’ve seen a Twitter video of an AI risk protest (I think possibly in continental Europe?) that struck me as extremely good: calm, thoughtful, accessible, punchy, and sensible-sounding statements and interview answers. If I find the link again, I’ll add it here as a model of what I think a robustly good protest can look like!
A leftist friend once argued that protest is not really a means, but a reward, a sort of party for those who contributed to local movementbuilding. I liked that view.
I wouldn’t recommend making protests purely this. A lot of these protests are getting news coverage and have a real chance of either intriguing/persuading or alienating potential allies; I think it’s worth putting thought into how to hit the “intriguing/persuading” target, regardless of whether this is “normal” for protests.
But I like the idea of “protest as reward” as an element of protests, or as a focus for some protests. :)
Could we talk about a specific expert you have in mind, who thinks this is a bad strategy in this particular case?
AI risk is a pretty weird case, in a number of ways: it’s highly counter-intuitive, not particularly politically polarized / entrenched, seems to require unprecedentedly fast and aggressive action by multiple countries, is almost maximally high-stakes, etc. “Be careful what you say, try to look normal, and slowly accumulate political capital and connections in the hope of swaying policymakers long-term” isn’t an unconditionally good strategy, it’s a strategy adapted to a particular range of situations and goals. I’d be interested in actually hearing arguments for why this strategy is the best option here, given MIRI’s world-model.
(Or, separately, you could argue against the world-model, if you disagree with us about how things are.)
?
I didn’t cross-post it, but I’ve poked EY about the title!