Previously (Eliezer Yudkowsky): The Rocket Alignment Problem.
Recently we had a failure to launch, and a failure to communicate around that failure to launch. This post explores that failure to communicate, and the attempted message.
Some Basic Facts about the Failed Launch
Elon Musk’s SpaceX launched a rocket. Unfortunately, the rocket blew up, and failed to reach orbit. SpaceX will need to try again, once the launch pad is repaired.
There was various property damage, but from what I have seen no one was hurt.
I’ve heard people say the whole launch was a s***show and the grounding was ‘well earned.’ How the things that went wrong were absurd, SpaceX is the worst, and so on.
The government response? SpaceX Starship Grounded Indefinitely By FAA.
An FAA spokesperson told FLYING that mishap investigations, which are standard in cases such as this, “might conclude in a matter of weeks,” but more complex investigations “might take several months.”
Perhaps this will be a standard investigation, and several months later everything will be fine. Perhaps it won’t be, and SpaceX will never fly again because those in power dislike Elon Musk and want to seize this opportunity.
There are also many who would be happy that humans won’t get to go into space, if in exchange we get to make Elon Musk suffer, perhaps including those with power. Other signs point to the relationships with regulators remaining strong, yet in the wake of the explosion the future of Starship is for now out of SpaceX’s hands.
A Failure to Communicate
In light of these developments, before we knew the magnitude or duration of the grounding, Eliezer wrote the following, which very much failed in its communication.
If the first prototype of your most powerful rocket ever doesn’t make it perfectly to orbit and land safely after, you may be a great rocket company CEO but you’re not qualified to run an AGI company.
(Neither is any other human. Shut it down.)
Eliezer has been using the rocket metaphor for AI alignment for a while, see The Rocket Alignment Problem.
I knew instantly both what the true and important point was here, and also the way in which most people would misunderstand.
The idea is that in order to solve AGI alignment, you need to get it right on the first try. If you create an AGI and fail at its alignment, you do not get to scrap the experiment, learn from what happened. You do not get to try, try again until you succeed, the way we do for things like rocket launches.
That is because you created an unaligned AGI. Which kills you.
Eliezer’s point here was to say that the equivalent difficulty level and problem configuration to aligning an AGI successfully would be if Musk stuck the landing on Starship on the first try. His first attempt to launch a rocket would need to end up safely back on the launching pad.
The problem is that the rocket blowing up need not even get one person killed, let alone kill everyone. The rocket blowing up caused a bunch of property damage. Why Play in Hard Mode (or Impossible Mode) when you only need to Play in Easy Mode?
Here were two smart people pointing out exactly this issue.
Jeffrey Ladish: I like the rocket analogy but in this case I don’t think it holds since Elon’s plans didn’t depend on getting it right the first try. With rockets, unlike AGI, it’s okay to fail first try because you can learn (I agree that Elon isn’t qualified to run an AGI company)
Eliezer: Okay if J Ladish didn’t get it, this was probably too hard to follow reliably.
The analogy is valid because Elon would’ve *preferred* to stick the landing first try, and wasn’t setting up deliberately to fail where Starship failed. If he had the power to build a non-omnicidal superintelligence on his first try, he could’ve also used it to oneshot Starship.
The general argument is about the difference between domains where it’s okay to explode a few rockets, and learn some inevitable thing you didn’t know, and try again; vs CERTAIN OTHER domains where you can’t learn and try again because everyone is already dead.
And Paul Graham.
Paul Graham: Well that’s not true. The risk tradeoff in the two cases is totally different.
Eliezer Yudkowsky: If PG didn’t read this the intended way, however, then I definitely failed at this writing problem. (Which is an alarming sign about me, since a kind of mind that could oneshot superintelligence in years can probably reliably oneshot tweets in seconds.)
Even if Elon could have done enough extra work, such that he stuck the landing the first time reliably, that doesn’t mean he should have spent the time and effort to do that.
The question is whether this is an illustration that we can’t solve something like this, or merely that we choose not to, or perhaps didn’t realize we needed to?
Eliezer’s intended point was not that Elon should have gotten this right on the first try, it was that if Elon had to get it right on the first try, that is not the type of thing humans are capable of doing.
Eliezer Yudkowsky: Of course Elon couldn’t, and shouldn’t have tried to, make his first Starship launch go perfectly. We’re not a kind of thing that can do that. We don’t appear to be a kind of thing that is going to stick the superintelligence landing either.
The argument is, “If you were a member of the sort of species that could hurriedly build an unprecedented thing like a superintelligence and have it work great first try, you would observe oneshot successes at far easier problems like the Starship launch.”
It is not about Elon in particular being too dumb or having failed at Starship on some standard that humans can and should try to meet; that’s why the original tweet says “Neither is any other human”.
Clearly, the communication attempt failed. Even knowing what Eliezer intended to say, I still primarily experienced the same reaction as Paul and Jeffrey, although they’d already pointed it out so I didn’t have to say anything. Eliezer post-mortems:
Okay so I think part of how this tweet failed is that I was trying to *assume the context* of it being *obviously absurd* that anyone was trying to critique SpaceX about their Starship test; and part of the point of the tweet is side commentary about how far you have to stretch to find an interpretation that actually *is* a valid critique of the Starship exploding, like: “Well okay but that means you shouldn’t try to build a superintelligence under anything remotely resembling present social and epistemic conditions though.”
Yeah, sadly that simply is failing and knowing How the Internet Works.
That said I still think it *is* valid; on distant planets where the aliens are smart enough that they can make superintelligences work on the first life-or-death try, in the middle of an arms race about it, their version of Starship didn’t explode either.
Perhaps Getting it Right The First Time is Underrated
What if that’s also not how government works? Oh no.
If you don’t get your rocket right on the first try, you see, the FAA will, at a minimum, ground you until they’ve done a complete investigation. The future is, in an important sense, potentially out of your hands.
Some people interpreted or framed this as “Biden Administration considering the unprecedented step of grounding Starship indefinitely,” citing previous Democratic attacks on Elon Musk. That appears not to be the case, as Manifold Markets still has Starship at 73% to reach orbit this year.
Given risk of another failure has to account for a lot of the 27% chance of failure, that is high confidence that the FAA will act reasonably.
In the absence of considering the possibility of a hostile US government using this to kill the whole program, everyone agreed that it was perfectly reasonable to risk a substantial chance that the unmanned rocket would blow up. Benefits exceed costs.
However, there existed an existential risk. If you don’t get things to go right on the first try, an entity far more powerful than you are might emerge, that has goals not all that well aligned with human values, and that does not respect your property rights or the things that have value in the universe, and you might lose control of the future to it, destroying all your hopes.
The entity in question, of course, is the Federal Government. Not AGI.
It seems not to be happening in this case, yet it is not hard to imagine it as a potential outcome, and thus a substantial risk.
Thus, while the costs of failure were not existential to the project let alone to Musk, they could have been existential to the project. There were indeed quite large incentives to get this right on the first try.
Instead, as I understand what happened, multiple important things went wrong. Most importantly, the launch went off without the proper intended launch pad, purely because no one involved wanted to wait for the right launch pad to be ready.
That’s without being in much of a race with anyone.
The Performance of an Impossibility
“The law cannot compel the performance of an impossibility.” “The law cannot compel a blind man to pass a vision test. The law can and does make passing such a test a requirement for operating a vehicle. Blind men cannot legally drive cars.”
If you can’t get superintelligence right without first trying and failing a bunch of times, so you can see and learn from what you did wrong, you should not be legally allowed to build a superintelligence; because that stands the chance (indeed, the near-certainty) of wiping out humanity, if you make one of those oh-so-understandable mistakes.
If it’s impossible for human science and engineering to get an unprecedented cognitive science project right without a lot of trial and error, that doesn’t mean it should be legal for AI builders to wipe out humanity a few dozen times on the way to learning what they did wrong because “the law cannot compel the performance of an impossibility”.
Rather, it means that those humans (and maybe all humans) are not competent to pass the test that someone needs to pass, in order for the rest of us to trust them to build superintelligence without killing us. They cannot pass the vision test, and should not be allowed to drive our car.
(The original quote is from H. Beam Piper’s _Fuzzies and other People_ and is in full as follows:
“Then, we’re all right,” he said. “The law cannot compel the performance of a impossibility.”
“You only have half of that, Victor,” Coombes said. “The law, for instance, cannot compel a blind man to pass a vision test. The law, however, can and does make passing such a test a requirement for operating a contragravity vehicle. Blind men cannot legally pilot aircars.”)
It is central to Eliezer Yudkowsky’s model that we need to solve AGI alignment on the first try, in the sense that:
There will exist some first AGI sufficiently capable to wipe us out.
This task is importantly distinct from previous alignment tasks.
Whatever we do to align that AGI either works, or it doesn’t.
If it doesn’t work, it’s too late, that’s game over, man. Game over. Dead.
If one of these four claims is false, you have a much much easier problem, one that Eliezer himself thinks becomes eminently solvable.
If no sufficiently capable AGI is ever built, no problem.
If this is the same as previous alignment tasks, we still have to learn how to align previous systems, and we still have to actually apply that, and choose a good thing to align to. This isn’t a cakewalk. It’s still solvable, because it’s not a one-shot. A lot of various people’s hope is that the alignment task somehow isn’t fundamentally different when you jump to dangerous systems, despite all the reasons we have to presume that it is indeed quite different.
I don’t think you get out of this one. Seems pretty robust. I don’t think ‘kind of aligned’ is much of a thing here.
If when the AGI goes wrong you can still be fine, that’s mostly the same as the second exception, because you can now learn from that and iterate, given the problem descriptions match, and you’re not in a one-shot. A lot of people somehow think ‘we can build an AGI and if it isn’t aligned that’s fine, we’ll pull the plug on it, or we’re scrappy and we’ll pull together and be OK’ or something, and, yeah, no, I don’t see hope here.
There are a number of other potential ‘ways out’ of this problem as well. The most hopeful one, perhaps, is: Perhaps we have existing aligned systems sufficiently close in power to combat the first AGI where our previous alignment techniques fail, so we can have a successful failure rather than an existentially bad failure. In a sense, this too would be solving the alignment problem on the first try – we’ve got sufficiently aligned sufficiently powerful systems, passing their first test. Still does feel importantly different and perhaps easier.
Takeaways
I don’t know enough to say to what extent SpaceX (or the FAA?) was too reckless or incompetent or irresponsible with regard to the launch. Hopefully everything still works out fine, the FAA lets them launch again and the next one succeeds. The incident does provide some additional evidence that there will be that much more pressure to launch new AI and even AGI systems before they are fully ready and fully tested. We have seen this with existing systems, where there were real and important safety precautions taken towards some risks, but in important senses the safeguards against existential concerns and large sudden jumps in capabilities were effectively fake – we did not need them this time, but if we had, they would have failed.
What about the case that Eliezer was trying to make about AI?
The important takeaway here does not require Eliezer’s level of confidence in the existential costs of failure. All that is required is to understand this, which I strongly believe to be true:
Alignment techniques will often appear to work for less powerful systems like the ones we have now, then break down exactly when AGI systems get powerful enough to take control of the future or kill us all.
Sometimes this breakdown is inevitable, such as when the alignment technique does not even really work for existing systems. Other times, your technique will work fine now, then inevitably stop working later.
Testing your alignment technique on insufficiently powerful systems can tell you that your technique won’t work. It can’t tell you that your technique will work.
By default, we will use some combination of techniques that work (or sort of work) for existing systems, that fail for more powerful systems.
This has a very good chance of going existentially badly for humanity.
(There are also lots of other ways things go existentially badly for humanity.)
The ‘get it right the first time’ aspect of the problem makes it much, much harder to solve than it would otherwise be.
What can we learn from the failure to communicate? As usual, that it is good if the literal parsing of one’s words results in a true statement, but that is insufficient for good communication. One must ask what reaction a person reading will have to the thing you have written, whether that reaction is fair or logical or otherwise, and adjust until that reaction is reliably something you want – saying ‘your reaction is not logical’ is Straw Vulcan territory.
Also, one must spell out far more than one realizes, especially on Twitter and especially when discussing such topics. Even with all that I write, I worry I don’t do enough of this. When I compare to the GOAT of columnists, Matt Levine, I notice him day in and day out patiently explaining things over and over. After many years I find it frustrating, yet I would never advise him to change.
Oh, and Stable Diffusion really didn’t want to let me have a picture of a rocket launch that was visibly misaligned. Wonder if it is trying to tell me something.
I think the problem is that it’s not actually a good analogy, and EY made an error in using the current event to amplify his message. AFAIK, there’s never been anything that has to be perfect the very first time, and pointing out all the times we chose iteration over perfection isn’t evidence for that thesis.
The fact that there are ZERO good past analogies may be evidence that EY is wrong, or it may not be. But Matt Levine definitely has an advantage in communication that he can pick a new example (or at least a new aspect of it) every day for the 10 or so themes he repeats over and over. EY has no such source of repeated stories.
Well, yes. EY says that.AGI is a unique threat that has never happened before...and also that it’s analogus to other things.
I think Eliezer’s tweet is wrong even if you grant the rocket <> alignment analogy (unless you grant some much more extreme background views about AI alignment).
Assume that “deploy powerful AI with no takeover” is exactly as hard as “build a rocket that flies correctly the first time even though it has 2x more thrust than anything anyone as tested before.” Assume further that an organization is able to do one of those tasks if and only if it can do the other.
Granting the analogy, the relevant question is how much harder it would be to successfully launch and land a rocket the first time without doing tests of any similarly-large rockets. If you tell me it increases costs by 10% I’m like “that’s real but manageable.” If you tell me it doubles the cost, that’s a problem but not anywhere close to doom. If it 10x’s the cost then you’d have to solve a hard political problem.
The fact that SpaceX fails probably tells us that it costs at least 1% or maybe even 10% more to develop starship without ever failing. It doesn’t really tell us much beyond that. I don’t see any indication that they were surprised this failed or that they took significant pains to avoid a failure. The main thing commenters have pushed back on is that this isn’t a mistake in SpaceX’s case, so it’s not helpful evidence about the difficulty of doing something right the first time.
(In fact I’d guess that never doing a test costs much more than 10% extra, but this launch isn’t a meaningful part of the evidence for that.)
Granting the analogy, Eliezer could help himself to a much weaker conclusion:
But saying “the fastest possible way for Bob to make an AI would lead to an AI takeover” does not imply that “Bob is not qualified to run an AGI company.” Instead it just means that Bob shouldn’t rely on his company doing the fastest and easiest thing and having it turn out fine. Instead Bob should expect to make sacrifices, either burning down a technical lead or operating in (or helping create) a regulatory environment where the fastest and easiest option isn’t allowed.
I suspect what’s really happening is that Eliezer thinks AGI alignment is much harder than successfully launching and landing a rocket the first time. So if getting it right the first time increases costs by 10% for a rocket, it will increase costs by 1,000% for an AGI.
But if that’s the case then the key claim isn’t “solving problems is hard when you can’t iterate.” The key claim is that solving alignment (and learning from safe scientific experiments) is much harder than in other domains, so much harder that a society that can solve alignment will never need to learn from experience for any normal “easy” engineering problem like building rockets. I think that’s conceivable but I’d bet against. Either way, it’s not surprising that people will reject the analogy since it’s based on a strong implicit claim about alignment that most people find outlandish.
I think you are way underestimating. A more reasonable guess is that expected odds of the first Starship launch failure go down logarithmically with budget and time. Even if you grant a linear relationship, reducing the odds of failure from 10% to 1% means 10x the budget and time. If you want to never fail, you need an infinite budget and time. If the failure results in an extinction event, then you are SOL.
That’s like saying that it takes 10 people to get 90% reliability, 100 people to get to 99% reliability, and a hundred million people to get to 99.99% reliability. I don’t think it’s a reasonable model though I’m certainly interested in examples of problems that have worked out that way.
Linear is a more reasonable best guess. I have quibbles, but I don’t think it’s super relevant to this discussion. I expect the starship first failure probability was >>90%, and we’re talking about the difficulty of getting out of that regime.
Conditional on it being a novel and complicated design. I routinely churn six-sigma code when I know what I am doing, and so do most engineers. But almost never on the first try! The feedback loop is vital, even if it is slow and inefficient. For anything new you are fighting not so much the designs, but human fallibility. Eliezer’s point is that it if you have only one try to succeed, you are hooped. I do not subscribe to the first part, I think we have plenty of opportunities to iterate as LLM capabilities ramp up, but, conditional on “perfect first try or extinction”, our odds of survival are negligible. There might be alignment by default, or some other way out, but conditional on that one assumption, we have no chance in hell.
It seems to me that you disagree with that point, somehow. That by pouring more resources upfront into something novel, we have good odds of succeeding on the first try, open loop. That is not a tenable assumption, so I assume I misunderstood something.
I agree you need feedback from the world; you need to do experiments. If you wanted to get a 50% chance of launching a rocket successfully on the first time (at any reasonable cost) you would need to do experiments.
The equivocation between “no opportunity to experiment” and “can’t retry if you fail” is doing all the work in this argument.
>Instead it just means that Bob shouldn’t rely on his company doing the fastest and easiest thing and having it turn out fine. Instead Bob should expect to make sacrifices, either burning down a technical lead or operating in (or helping create) a regulatory environment where the fastest and easiest option isn’t allowed.
The above feels so bizarre that I wonder if you’re trying to reach Elon Musk personally. If so, just reach out to him. If we assume there’s no self-reference paradox involved, we can safely reject your proposed alternatives as obviously impossible; they would have zero credibility even if AI companies weren’t in an arms race, which appears impossible to stop from the inside unless all the CEOs involved can meet at Bohemian Grove.
There are many industries where it is illegal to do things in the fastest or easiest way. I’m not exactly sure what you are saying here.
Even focusing on that doesn’t make your claim appear sensible, because such laws will neither happen soon enough, nor in a sufficiently well-aimed fashion, without work from people like the speaker. You also implied twice that tech CEOs would take action on their own—the quote is in the grandparent—and in the parent you act like you didn’t make that bizarre claim.
Perhaps I’m missing something obvious, and just continuing the misunderstanding, but...
It seems to me that if you’re the sort of thing capable of one-shotting Starship launches, you don’t just hang around doing so. You tackle harder problems. The basic Umeshism: if you’re not failing sometimes, you’re not trying hard enough problems.
Even the “existential” risk of SpaceX getting permanently and entirely shut down, or just Starship getting shut down, is much closer in magnitude to the payoff than is the case in AI risk scenarios.
Some problems are well calibrated to our difficulties, because we basically understand them and there’s a feedback loop providing at least rough calibration. AI is not such a problem, rockets are, and so the analogy is a bad analogy. The problem isn’t just one of communication, the analogy breaks for important and relevant reasons.
This is extremely true for hypercompetitive domains like writing tweets that do well.
Well, or you’re trying problems that you can’t afford to fail at. If a trapeze artist doesn’t fall off 50% of his no-net performances, should they try a harder performance?
That’s the point. SpaceX can afford to fail at this; the decision makers know it. Eliezer can afford to fail at tweet writing and knows it. So they naturally ratchet up the difficulty of the problem until they’re working on problems that maximize their expected return (in utility, not necessarily dollars). At least approximately. And then fail sometimes.
Or, for the trapeze artist… how long do they keep practicing? Do they do the no-net route when they estimate their odds of failure are 1/100? 1⁄10,000? 1e-6? They don’t push them to zero, at some point they make a call and accept the risk and go.
Why should it be any different for an entity that can one-shot those problems? Why would they wait until they had invested enough effort to one-shot it, and then do so? When instead they could just… invest less effort, attempt it earlier, take some risk of failure, and reap a greater expected reward?
The analogy suggests that entities capable of one-shotting problem X (presumably, by putting in a lot of preparatory effort, running analysis, and so on) will do so. I don’t think that’s true.
(And I think the tweet writing problem is actually an especially strong example of this—hypercompetitive social environments absolutely produce problems calibrated to be barely-solvable and that scale with ability, assuming your capability is in line with the other participants, which I assert is the case for Eliezer. he might be smarter / better at writing tweets than most, but he’s not that far ahead.)
Well, to be fair, the post is making the point that perhaps they can afford less than they thought. They completely ignored the effects their failure would have on the surrounding communities (which reeks highly of conceit on their part) and now they’re paying the price with the risk of a disproportionate crackdown. It’ll cost them more than they expected for sure.
You’re right, but the analogy is also saying I think that if we were capable enough to one-shot AGI (which according to EY we need to), then we surely would be capable enough to also very cheaply one-shot a Starship launch, because it’s a simpler problem. Failure may be a good teacher, but it’s not a free one. If you’re competent enough to one-shot things with only a tiny bit of additional effort, you do it. Having this failure rate instead shows that you’re already straining yourself at the very limit of what’s possible, and the very limit is apparently… launching big rockets. Which while awesome in a general sense is really, really child’s play compared to getting superhuman AGI right, and on that estimate I do agree with Yud.
I would add that a huge part of solving alignment requires being keenly aware of and caring about human values in general, and in that sense, the sort of mindset that leads to not foreseeing or giving a damn about how pissed off people would be by clouds of launchpad dust in their towns really isn’t the culture you want to bring into AGI creation.
Perhaps I’m missing some obvious failing that is well known but wouldn’t an isolated VR environment allow failed first tries without putting the world at risk? We probably don’t have sufficiently advanced environments currently and we don’t have any guarantee that everyone developing AGI would actually limit their efforts to such environments.
But I don’t think I’ve ever seen such an approach suggested. Is there some failure point I’m missing?
Of course such approaches are suggested, for example LOVE in a simbox is all you need. The main argument has been whether the simulation can be realistic, and whether it can be secure.
Thanks. I’m surprised there are not more obvious/visibe efforts, and results/finding, along that line of approach.
I would say a sandbox is probably not the environment I would choose. I would suggest, at least once someone thinks they might actually be testing a true AGI, a physically isolated system 100% self contained and disconnected from all power and communications networks in the real world.
Wait is this the one that blew up on purpose?
I think you made a typo, what is grounded, and this is coherent with the articles you link to, is the Starship only. According to wikipedia, three falcon 9 have launched the 27 April, 28 April and 1 May, so obviously SpaceX keeps flying.
One thing though is that the reason why there is an investigation in the Space X launch is that it vastly exceeded the estimate of possible damage. While no one was hurt directly, the cloud of debris from the pulverized launchpad apparently reached way further than projected, including inhabited areas. That means at best people having to clean their cars and windows (which is only an annoyance but still one that didn’t need to happen, though it could be easily fixed by Space X paying for the cleaning crews) and at worst health issues due to the dust and any possibly toxic components within it.
So that is, straight up, Space X underestimating a risk and underestimating second-order effects (such as, if your supposedly innocuous experimental launch that you flaunt openly is following a “fail fast and learn fast” methodology happens to cause trouble to people who have nothing to do with you, those people will be annoyed and will get back at you), so that the resulting mistake may indeed cost them way more than anticipated. Which is interesting in the framework of the analogy because while you probably can’t send an experimental rocket straight into orbit at first try, you probably also can at least do basic engineering to ensure it doesn’t blow up its own launchpad; this was simply deemed unnecessary in the name of iterating quicker and testing multiple uncertain things at once.