MIRI announces new “Death With Dignity” strategy
tl;dr: It’s obvious at this point that humanity isn’t going to solve the alignment problem, or even try very hard, or even go out with much of a fight. Since survival is unattainable, we should shift the focus of our efforts to helping humanity die with with slightly more dignity.
Well, let’s be frank here. MIRI didn’t solve AGI alignment and at least knows that it didn’t. Paul Christiano’s incredibly complicated schemes have no chance of working in real life before DeepMind destroys the world. Chris Olah’s transparency work, at current rates of progress, will at best let somebody at DeepMind give a highly speculative warning about how the current set of enormous inscrutable tensors, inside a system that was recompiled three weeks ago and has now been training by gradient descent for 20 days, might possibly be planning to start trying to deceive its operators.
Management will then ask what they’re supposed to do about that.
Whoever detected the warning sign will say that there isn’t anything known they can do about that. Just because you can see the system might be planning to kill you, doesn’t mean that there’s any known way to build a system that won’t do that. Management will then decide not to shut down the project—because it’s not certain that the intention was really there or that the AGI will really follow through, because other AGI projects are hard on their heels, because if all those gloomy prophecies are true then there’s nothing anybody can do about it anyways. Pretty soon that troublesome error signal will vanish.
When Earth’s prospects are that far underwater in the basement of the logistic success curve, it may be hard to feel motivated about continuing to fight, since doubling our chances of survival will only take them from 0% to 0%.
That’s why I would suggest reframing the problem—especially on an emotional level—to helping humanity die with dignity, or rather, since even this goal is realistically unattainable at this point, die with slightly more dignity than would otherwise be counterfactually obtained.
Consider the world if Chris Olah had never existed. It’s then much more likely that nobody will even try and fail to adapt Olah’s methodologies to try and read complicated facts about internal intentions and future plans, out of whatever enormous inscrutable tensors are being integrated a million times per second, inside of whatever recently designed system finished training 48 hours ago, in a vast GPU farm that’s already helpfully connected to the Internet.
It is more dignified for humanity—a better look on our tombstone—if we die after the management of the AGI project was heroically warned of the dangers but came up with totally reasonable reasons to go ahead anyways.
Or, failing that, if people made a heroic effort to do something that could maybe possibly have worked to generate a warning like that but couldn’t actually in real life because the latest tensors were in a slightly different format and there was no time to readapt the methodology. Compared to the much less dignified-looking situation if there’s no warning and nobody even tried to figure out how to generate one.
Or take MIRI. Are we sad that it looks like this Earth is going to fail? Yes. Are we sad that we tried to do anything about that? No, because it would be so much sadder, when it all ended, to face our ends wondering if maybe solving alignment would have just been as easy as buckling down and making a serious effort on it—not knowing if that would’ve just worked, if we’d only tried, because nobody had ever even tried at all. It wasn’t subjectively overdetermined that the (real) problems would be too hard for us, before we made the only attempt at solving them that would ever be made. Somebody needed to try at all, in case that was all it took.
It’s sad that our Earth couldn’t be one of the more dignified planets that makes a real effort, correctly pinpointing the actual real difficult problems and then allocating thousands of the sort of brilliant kids that our Earth steers into wasting their lives on theoretical physics. But better MIRI’s effort than nothing. What were we supposed to do instead, pick easy irrelevant fake problems that we could make an illusion of progress on, and have nobody out of the human species even try to solve the hard scary real problems, until everybody just fell over dead?
This way, at least, some people are walking around knowing why it is that if you train with an outer loss function that enforces the appearance of friendliness, you will not get an AI internally motivated to be friendly in a way that persists after its capabilities start to generalize far out of the training distribution...
To be clear, nobody’s going to listen to those people, in the end. There will be more comforting voices that sound less politically incongruent with whatever agenda needs to be pushed forward that week. Or even if that ends up not so, this isn’t primarily a social-political problem, of just getting people to listen. Even if DeepMind listened, and Anthropic knew, and they both backed off from destroying the world, that would just mean Facebook AI Research destroyed the world a year(?) later.
But compared to being part of a species that walks forward completely oblivious into the whirling propeller blades, with nobody having seen it at all or made any effort to stop it, it is dying with a little more dignity, if anyone knew at all. You can feel a little incrementally prouder to have died as part of a species like that, if maybe not proud in absolute terms.
If there is a stronger warning, because we did more transparency research? If there’s deeper understanding of the real dangers and those come closer to beating out comfortable nonrealities, such that DeepMind and Anthropic really actually back off from destroying the world and let Facebook AI Research do it instead? If they try some hopeless alignment scheme whose subjective success probability looks, to the last sane people, more like 0.1% than 0? Then we have died with even more dignity! It may not get our survival probabilities much above 0%, but it would be so much more dignified than the present course looks to be!
Now of course the real subtext here, is that if you can otherwise set up the world so that it looks like you’ll die with enough dignity—die of the social and technical problems that are really unavoidable, after making a huge effort at coordination and technical solutions and failing, rather than storming directly into the whirling helicopter blades as is the present unwritten plan -
- heck, if there was even a plan at all -
- then maybe possibly, if we’re wrong about something fundamental, somehow, somewhere -
- in a way that makes things easier rather than harder, because obviously we’re going to be wrong about all sorts of things, it’s a whole new world inside of AGI -
- although, when you’re fundamentally wrong about rocketry, this does not usually mean your rocket prototype goes exactly where you wanted on the first try while consuming half as much fuel as expected; it means the rocket explodes earlier yet, and not in a way you saw coming, being as wrong as you were -
- but if we get some miracle of unexpected hope, in those unpredicted inevitable places where our model is wrong -
- then our ability to take advantage of that one last hope, will greatly depend on how much dignity we were set to die with, before then.
If we can get on course to die with enough dignity, maybe we won’t die at all...?
In principle, yes. Let’s be very clear, though: Realistically speaking, that is not how real life works.
It’s possible for a model error to make your life easier. But you do not get more surprises that make your life easy, than surprises that make your life even more difficult. And people do not suddenly become more reasonable, and make vastly more careful and precise decisions, as soon as they’re scared. No, not even if it seems to you like their current awful decisions are weird and not-in-the-should-universe, and surely some sharp shock will cause them to snap out of that weird state into a normal state and start outputting the decisions you think they should make.
So don’t get your heart set on that “not die at all” business. Don’t invest all your emotion in a reward you probably won’t get. Focus on dying with dignity—that is something you can actually obtain, even in this situation. After all, if you help humanity die with even one more dignity point, you yourself die with one hundred dignity points! Even if your species dies an incredibly undignified death, for you to have helped humanity go down with even slightly more of a real fight, is to die an extremely dignified death.
“Wait, dignity points?” you ask. “What are those? In what units are they measured, exactly?”
And to this I reply: Obviously, the measuring units of dignity are over humanity’s log odds of survival—the graph on which the logistic success curve is a straight line. A project that doubles humanity’s chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity.
But if enough people can contribute enough bits of dignity like that, wouldn’t that mean we didn’t die at all? Yes, but again, don’t get your hopes up. Don’t focus your emotions on a goal you’re probably not going to obtain. Realistically, we find a handful of projects that contribute a few more bits of counterfactual dignity; get a bunch more not-specifically-expected bad news that makes the first-order object-level situation look even worse (where to second order, of course, the good Bayesians already knew that was how it would go); and then we all die.
With a technical definition in hand of what exactly constitutes dignity, we may now consider some specific questions about what does and doesn’t constitute dying with dignity.
Q1: Does ‘dying with dignity’ in this context mean accepting the certainty of your death, and not childishly regretting that or trying to fight a hopeless battle?
Don’t be ridiculous. How would that increase the log odds of Earth’s survival?
My utility function isn’t up for grabs, either. If I regret my planet’s death then I regret it, and it’s beneath my dignity to pretend otherwise.
That said, I fought hardest while it looked like we were in the more sloped region of the logistic success curve, when our survival probability seemed more around the 50% range; I borrowed against my future to do that, and burned myself out to some degree. That was a deliberate choice, which I don’t regret now; it was worth trying, I would not have wanted to die having not tried, I would not have wanted Earth to die without anyone having tried. But yeah, I am taking some time partways off, and trying a little less hard, now. I’ve earned a lot of dignity already; and if the world is ending anyways and I can’t stop it, I can afford to be a little kind to myself about that.
When I tried hard and burned myself out some, it was with the understanding, within myself, that I would not keep trying to do that forever. We cannot fight at maximum all the time, and some times are more important than others. (Namely, when the logistic success curve seems relatively more sloped; those times are relatively more important.)
All that said: If you fight marginally longer, you die with marginally more dignity. Just don’t undignifiedly delude yourself about the probable outcome.
Q2: I have a clever scheme for saving the world! I should act as if I believe it will work and save everyone, right, even if there’s arguments that it’s almost certainly misguided and doomed? Because if those arguments are correct and my scheme can’t work, we’re all dead anyways, right?
A: No! That’s not dying with dignity! That’s stepping sideways out of a mentally uncomfortable world and finding an escape route from unpleasant thoughts! If you condition your probability models on a false fact, something that isn’t true on the mainline, it means you’ve mentally stepped out of reality and are now living somewhere else instead.
There are more elaborate arguments against the rationality of this strategy, but consider this quick heuristic for arriving at the correct answer: That’s not a dignified way to die. Death with dignity means going on mentally living in the world you think is reality, even if it’s a sad reality, until the end; not abandoning your arts of seeking truth; dying with your commitment to reason intact.
You should try to make things better in the real world, where your efforts aren’t enough and you’re going to die anyways; not inside a fake world you can save more easily.
Q2: But what’s wrong with the argument from expected utility, saying that all of humanity’s expected utility lies within possible worlds where my scheme turns out to be feasible after all?
A: Most fundamentally? That’s not what the surviving worlds look like. The surviving worlds look like people who lived inside their awful reality and tried to shape up their impossible chances; until somehow, somewhere, a miracle appeared—the model broke in a positive direction, for once, as does not usually occur when you are trying to do something very difficult and hard to understand, but might still be so—and they were positioned with the resources and the sanity to take advantage of that positive miracle, because they went on living inside uncomfortable reality. Positive model violations do ever happen, but it’s much less likely that somebody’s specific desired miracle that “we’re all dead anyways if not...” will happen; these people have just walked out of the reality where any actual positive miracles might occur.
Also and in practice? People don’t just pick one comfortable improbability to condition on. They go on encountering unpleasant facts true on the mainline, and each time saying, “Well, if that’s true, I’m doomed, so I may as well assume it’s not true,” and they say more and more things like this. If you do this it very rapidly drives down the probability mass of the ‘possible’ world you’re mentally inhabiting. Pretty soon you’re living in a place that’s nowhere near reality. If there were an expected utility argument for risking everything on an improbable assumption, you’d get to make exactly one of them, ever. People using this kind of thinking usually aren’t even keeping track of when they say it, let alone counting the occasions.
Also also, in practice? In domains like this one, things that seem to first-order like they “might” work… have essentially no chance of working in real life, to second-order after taking into account downward adjustments against optimism. AGI is a scientifically unprecedented experiment and a domain with lots of optimization pressures some of which work against you and unforeseeable intelligently selected execution pathways and with a small target to hit and all sorts of extreme forces that break things and that you couldn’t fully test before facing them. AGI alignment seems like it’s blatantly going to be an enormously Murphy-cursed domain, like rocket prototyping or computer security but worse.
In a domain like, if you have a clever scheme for winning anyways that, to first-order theoretical theory, totally definitely seems like it should work, even to Eliezer Yudkowsky rather than somebody who just goes around saying that casually, then maybe there’s like a 50% chance of it working in practical real life after all the unexpected disasters and things turning out to be harder than expected.
If to first-order it seems to you like something in a complicated unknown untested domain has a 40% chance of working, it has a 0% chance of working in real life.
Also also also in practice? Harebrained schemes of this kind are usually actively harmful. Because they’re invented by the sort of people who’ll come up with an unworkable scheme, and then try to get rid of counterarguments with some sort of dismissal like “Well if not then we’re all doomed anyways.”
If nothing else, this kind of harebrained desperation drains off resources from those reality-abiding efforts that might try to do something on the subjectively apparent doomed mainline, and so position themselves better to take advantage of unexpected hope, which is what the surviving possible worlds mostly look like.
The surviving worlds don’t look like somebody came up with a harebrained scheme, dismissed all the obvious reasons it wouldn’t work with “But we have to bet on it working,” and then it worked.
That’s the elaborate argument about what’s rational in terms of expected utility, once reasonable second-order commonsense adjustments are taken into account. Note, however, that if you have grasped the intended emotional connotations of “die with dignity”, it’s a heuristic that yields the same answer much faster. It’s not dignified to pretend we’re less doomed than we are, or step out of reality to live somewhere else.
Q3: Should I scream and run around and go through the streets wailing of doom?
A: No, that’s not very dignified. Have a private breakdown in your bedroom, or a breakdown with a trusted friend, if you must.
Q3: Why is that bad from a coldly calculating expected utility perspective, though?
A: Because it associates belief in reality with people who act like idiots and can’t control their emotions, which worsens our strategic position in possible worlds where we get an unexpected hope.
Q4: Should I lie and pretend everything is fine, then? Keep everyone’s spirits up, so they go out with a smile, unknowing?
A: That also does not seem to me to be dignified. If we’re all going to die anyways, I may as well speak plainly before then. If into the dark we must go, let’s go there speaking the truth, to others and to ourselves, until the end.
Q4: Okay, but from a coldly calculating expected utility perspective, why isn’t it good to lie to keep everyone calm? That way, if there’s an unexpected hope, everybody else will be calm and oblivious and not interfering with us out of panic, and my faction will have lots of resources that they got from lying to their supporters about how much hope there was! Didn’t you just say that people screaming and running around while the world was ending would be unhelpful?
A: You should never try to reason using expected utilities again. It is an art not meant for you. Stick to intuitive feelings henceforth.
There are, I think, people whose minds readily look for and find even the slightly-less-than-totally-obvious considerations of expected utility, what some might call “second-order” considerations. Ask them to rob a bank and give the money to the poor, and they’ll think spontaneously and unprompted about insurance costs of banking and the chance of getting caught and reputational repercussions and low-trust societies and what if everybody else did that when they thought it was a good cause; and all of these considerations will be obviously-to-them consequences under consequentialism.
These people are well-suited to being ‘consequentialists’ or ‘utilitarians’, because their mind naturally sees all the consequences and utilities, including those considerations that others might be tempted to call by names like “second-order” or “categorical” and so on.
If you ask them why consequentialism doesn’t say to rob banks, they reply, “Because that actually realistically in real life would not have good consequences. Whatever it is you’re about to tell me as a supposedly non-consequentialist reason why we all mustn’t do that, seems to you like a strong argument, exactly because you recognize implicitly that people robbing banks would not actually lead to happy formerly-poor people and everybody living cheerfully ever after.”
Others, if you suggest to them that they should rob a bank and give the money to the poor, will be able to see the helped poor as a “consequence” and a “utility”, but they will not spontaneously and unprompted see all those other considerations in the formal form of “consequences” and “utilities”.
If you just asked them informally whether it was a good or bad idea, they might ask “What if everyone did that?” or “Isn’t it good that we can live in a society where people can store and transmit money?” or “How would it make effective altruism look, if people went around doing that in the name of effective altruism?” But if you ask them about consequences, they don’t spontaneously, readily, intuitively classify all these other things as “consequences”; they think that their mind is being steered onto a kind of formal track, a defensible track, a track of stating only things that are very direct or blatant or obvious. They think that the rule of consequentialism is, “If you show me a good consequence, I have to do that thing.”
If you present them with bad things that happen if people rob banks, they don’t see those as also being ‘consequences’. They see them as arguments against consequentialism; since, after all consequentialism says to rob banks, which obviously leads to bad stuff, and so bad things would end up happening if people were consequentialists. They do not do a double-take and say “What?” That consequentialism leads people to do bad things with bad outcomes is just a reasonable conclusion, so far as they can tell.
People like this should not be ‘consequentialists’ or ‘utilitarians’ as they understand those terms. They should back off from this form of reasoning that their mind is not naturally well-suited for processing in a native format, and stick to intuitively informally asking themselves what’s good or bad behavior, without any special focus on what they think are ‘outcomes’.
If they try to be consequentialists, they’ll end up as Hollywood villains describing some grand scheme that violates a lot of ethics and deontology but sure will end up having grandiose benefits, yup, even while everybody in the audience knows perfectly well that it won’t work. You can only safely be a consequentialist if you’re genre-savvy about that class of arguments—if you’re not the blind villain on screen, but the person in the audience watching who sees why that won’t work.
Q4: I know EAs shouldn’t rob banks, so this obviously isn’t directed at me, right?
A: The people of whom I speak will look for and find the reasons not to do it, even if they’re in a social environment that doesn’t have strong established injunctions against bank-robbing specifically exactly. They’ll figure it out even if you present them with a new problem isomorphic to bank-robbing but with the details changed.
Which is basically what you just did, in my opinion.
Q4: But from the standpoint of cold-blooded calculation -
A: Calculations are not cold-blooded. What blood we have in us, warm or cold, is something we can learn to see more clearly with the light of calculation.
If you think calculations are cold-blooded, that they only shed light on cold things or make them cold, then you shouldn’t do them. Stay by the warmth in a mental format where warmth goes on making sense to you.
Q4: Yes yes fine fine but what’s the actual downside from an expected-utility standpoint?
A: If good people were liars, that would render the words of good people meaningless as information-theoretic signals, and destroy the ability for good people to coordinate with others or among themselves.
If the world can be saved, it will be saved by people who didn’t lie to themselves, and went on living inside reality until some unexpected hope appeared there.
If those people went around lying to others and paternalistically deceiving them—well, mostly, I don’t think they’ll have really been the types to live inside reality themselves. But even imagining the contrary, good luck suddenly unwinding all those deceptions and getting other people to live inside reality with you, to coordinate on whatever suddenly needs to be done when hope appears, after you drove them outside reality before that point. Why should they believe anything you say?
Q4: But wouldn’t it be more clever to -
A: Stop. Just stop. This is why I advised you to reframe your emotional stance as dying with dignity.
Maybe there’d be an argument about whether or not to violate your ethics if the world was actually going to be saved at the end. But why break your deontology if it’s not even going to save the world? Even if you have a price, should you be that cheap?
Q4 But we could maybe save the world by lying to everyone about how much hope there was, to gain resources, until -
A: You’re not getting it. Why violate your deontology if it’s not going to really actually save the world in real life, as opposed to a pretend theoretical thought experiment where your actions have only beneficial consequences and none of the obvious second-order detriments?
It’s relatively safe to be around an Eliezer Yudkowsky while the world is ending, because he’s not going to do anything extreme and unethical unless it would really actually save the world in real life, and there are no extreme unethical actions that would really actually save the world the way these things play out in real life, and he knows that. He knows that the next stupid sacrifice-of-ethics proposed won’t work to save the world either, actually in real life. He is a ‘pessimist’ - that is, a realist, a Bayesian who doesn’t update in a predictable direction, a genre-savvy person who knows that the viewer would say if there were a villain on screen making that argument for violating ethics. He will not, like a Hollywood villain onscreen, be deluded into thinking that some clever-sounding deontology-violation is bound to work out great, when everybody in the audience watching knows perfectly well that it won’t.
My ethics aren’t for sale at the price point of failure. So if it looks like everything is going to fail, I’m a relatively safe person to be around.
I’m a genre-savvy person about this genre of arguments and a Bayesian who doesn’t update in a predictable direction. So if you ask, “But Eliezer, what happens when the end of the world is approaching, and in desperation you cling to whatever harebrained scheme has Goodharted past your filters and presented you with a false shred of hope; what then will you do?”—I answer, “Die with dignity.” Where “dignity” in this case means knowing perfectly well that’s what would happen to some less genre-savvy person; and my choosing to do something else which is not that. But “dignity” yields the same correct answer and faster.
Q5: “Relatively” safe?
A: It’d be disingenuous to pretend that it wouldn’t be even safer to hang around somebody who had no clue what was coming, didn’t know any mental motions for taking a worldview seriously, thought it was somebody else’s problem to ever do anything, and would just cheerfully party with you until the end.
Within the class of people who know the world is ending and consider it to be their job to do something about that, Eliezer Yudkowsky is a relatively safe person to be standing next to. At least, before you both die anyways, as is the whole problem there.
Q5: Some of your self-proclaimed fans don’t strike me as relatively safe people to be around, in that scenario?
A: I failed to teach them whatever it is I know. Had I known then what I knew now, I would have warned them not to try.
If you insist on putting it into terms of fandom, though, feel free to notice that Eliezer Yudkowsky is much closer to being a typical liberaltarian science-fiction fan, as was his own culture that actually birthed him, than he is a typical member of any subculture that might have grown up later. Liberaltarian science-fiction fans do not usually throw away all their ethics at the first sign of trouble. They grew up reading books where those people were the villains.
Please don’t take this as a promise from me to play nice, as you define niceness; the world is ending, and also people have varying definitions of what is nice. But I presently mostly expect to end up playing nice, because there won’t be any options worth playing otherwise.
It is a matter of some concern to me that all this seems to be an alien logic to some strange people who—this fact is still hard for me to grasp on an emotional level—don’t spontaneously generate all of this reasoning internally, as soon as confronted with the prompt. Alas.
Q5: Then isn’t it unwise to speak plainly of these matters, when fools may be driven to desperation by them? What if people believe you about the hopeless situation, but refuse to accept that conducting themselves with dignity is the appropriate response?
A: I feel like I’ve now tried to live my life that way for a while, by the dictum of not panicking people; and, like everything else I’ve tried, that hasn’t particularly worked? There are no plans left to avoid disrupting, now, with other people’s hypothetical panic.
I think we die with slightly more dignity—come closer to surviving, as we die—if we are allowed to talk about these matters plainly. Even given that people may then do unhelpful things, after being driven mad by overhearing sane conversations. I think we die with more dignity that way, than if we go down silent and frozen and never talking about our impending death for fear of being overheard by people less sane than ourselves.
I think that in the last surviving possible worlds with any significant shred of subjective probability, people survived in part because they talked about it; even if that meant other people, the story’s antagonists, might possibly hypothetically panic.
But still, one should present the story-antagonists with an easy line of retreat. So -
Q6: Hey, this was posted on April 1st. All of this is just an April Fool’s joke, right?
A: Why, of course! Or rather, it’s a preview of what might be needful to say later, if matters really do get that desperate. You don’t want to drop that on people suddenly and with no warning.
Q6: Oh. Really? That would be such a relief!
A: Only you can decide whether to live in one mental world or the other.
Q6: Wait, now I’m confused. How do I decide which mental world to live in?
A: By figuring out what is true, and by allowing no other considerations than that to enter; that’s dignity.
Q6: But that doesn’t directly answer the question of which world I’m supposed to mentally live in! Can’t somebody just tell me that?
A: Well, conditional on you wanting somebody to tell you that, I’d remind you that many EAs hold that it is very epistemically unvirtuous to just believe what one person tells you, and not weight their opinion and mix it with the weighted opinions of others?
Lots of very serious people will tell you that AGI is thirty years away, and that’s plenty of time to turn things around, and nobody really knows anything about this subject matter anyways, and there’s all kinds of plans for alignment that haven’t been solidly refuted so far as they can tell.
I expect the sort of people who are very moved by that argument, to be happier, more productive, and less disruptive, living mentally in that world.
Q6: Thanks for answering my question! But aren’t I supposed to assign some small probability to your worldview being correct?
A: Conditional on you being the sort of person who thinks you’re obligated to do that and that’s the reason you should do it, I’d frankly rather you didn’t. Or rather, seal up that small probability in a safe corner of your mind which only tells you to stay out of the way of those gloomy people, and not get in the way of any hopeless plans they seem to have.
Q6: Got it. Thanks again!
A: You’re welcome! Goodbye and have fun!
- AGI Ruin: A List of Lethalities by 5 Jun 2022 22:05 UTC; 915 points) (
- What an actually pessimistic containment strategy looks like by 5 Apr 2022 0:19 UTC; 676 points) (
- Simulators by 2 Sep 2022 12:45 UTC; 612 points) (
- (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser by 30 Nov 2024 2:55 UTC; 562 points) (
- Let’s think about slowing down AI by 22 Dec 2022 17:40 UTC; 551 points) (
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by 29 Aug 2022 1:23 UTC; 413 points) (
- Let’s think about slowing down AI by 23 Dec 2022 19:56 UTC; 334 points) (EA Forum;
- Beware boasting about non-existent forecasting track records by 20 May 2022 19:20 UTC; 331 points) (
- Nobody’s on the ball on AGI alignment by 29 Mar 2023 14:26 UTC; 327 points) (EA Forum;
- Nobody’s on the ball on AGI alignment by 29 Mar 2023 14:26 UTC; 327 points) (EA Forum;
- On Deference and Yudkowsky’s AI Risk Estimates by 19 Jun 2022 14:35 UTC; 285 points) (EA Forum;
- Don’t die with dignity; instead play to your outs by 6 Apr 2022 7:53 UTC; 280 points) (
- So, geez there’s a lot of AI content these days by 6 Oct 2022 21:32 UTC; 257 points) (
- Mental Health and the Alignment Problem: A Compilation of Resources (updated April 2023) by 10 May 2023 19:04 UTC; 255 points) (
- The Case for AI Safety Advocacy to the Public by 20 Sep 2023 12:03 UTC; 251 points) (EA Forum;
- A Quick Guide to Confronting Doom by 13 Apr 2022 19:30 UTC; 240 points) (
- My Most Likely Reason to Die Young is AI X-Risk by 4 Jul 2022 15:34 UTC; 237 points) (EA Forum;
- AGI in sight: our look at the game board by 18 Feb 2023 22:17 UTC; 226 points) (
- Catching the Eye of Sauron by 7 Apr 2023 0:40 UTC; 221 points) (
- Orthogonal: A new agent foundations alignment organization by 19 Apr 2023 20:17 UTC; 216 points) (
- Connor Leahy on Dying with Dignity, EleutherAI and Conjecture by 22 Jul 2022 18:44 UTC; 195 points) (
- Deliberate Grieving by 30 May 2022 20:49 UTC; 180 points) (
- AGI Ruin: A List of Lethalities by 6 Jun 2022 23:28 UTC; 162 points) (EA Forum;
- Emotionally Confronting a Probably-Doomed World: Against Motivation Via Dignity Points by 10 Apr 2022 18:45 UTC; 154 points) (
- Reshaping the AI Industry by 29 May 2022 22:54 UTC; 147 points) (
- AI Pause Will Likely Backfire by 16 Sep 2023 10:21 UTC; 141 points) (EA Forum;
- Distancing EA from rationality is foolish by 25 Jun 2024 21:02 UTC; 137 points) (EA Forum;
- Some Observations on Alcoholism by 8 Jul 2023 1:34 UTC; 134 points) (EA Forum;
- Three Reflections from 101 EA Global Conversations by 25 Apr 2022 22:02 UTC; 128 points) (EA Forum;
- Success without dignity: a nearcasting story of avoiding catastrophe by luck by 15 Mar 2023 20:17 UTC; 113 points) (EA Forum;
- Nobody’s on the ball on AGI alignment by 29 Mar 2023 17:40 UTC; 102 points) (
- Here’s the exit. by 21 Nov 2022 18:07 UTC; 97 points) (
- Towards Hodge-podge Alignment by 19 Dec 2022 20:12 UTC; 93 points) (
- The case for Doing Something Else (if Alignment is doomed) by 5 Apr 2022 17:52 UTC; 93 points) (
- In defense of flailing, with foreword by Bill Burr by 17 Jun 2022 16:40 UTC; 88 points) (
- The Case for Earning to Live by 1 Apr 2023 16:56 UTC; 85 points) (EA Forum;
- 11 Jul 2022 15:52 UTC; 84 points) 's comment on EA for dumb people? by (EA Forum;
- continue working on hard alignment! don’t give up! by 24 Mar 2023 0:14 UTC; 82 points) (
- Finally Entering Alignment by 10 Apr 2022 17:01 UTC; 80 points) (
- AI alignment researchers may have a comparative advantage in reducing s-risks by 15 Feb 2023 13:01 UTC; 79 points) (EA Forum;
- AI Training Should Allow Opt-Out by 23 Jun 2022 1:33 UTC; 76 points) (
- Success without dignity: a nearcasting story of avoiding catastrophe by luck by 14 Mar 2023 19:23 UTC; 76 points) (
- Quick Thoughts on A.I. Governance by 30 Apr 2022 14:49 UTC; 69 points) (
- AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by 2 May 2023 10:17 UTC; 68 points) (EA Forum;
- Orthogonal’s Formal-Goal Alignment theory of change by 5 May 2023 22:36 UTC; 68 points) (
- The current alignment plan, and how we might improve it | EAG Bay Area 23 by 7 Jun 2023 21:03 UTC; 66 points) (EA Forum;
- Deontology and virtue ethics as “effective theories” of consequentialist ethics by 17 Nov 2022 14:11 UTC; 63 points) (
- my current outlook on AI risk mitigation by 3 Oct 2022 20:06 UTC; 63 points) (
- 16 Sep 2023 14:49 UTC; 62 points) 's comment on AI Pause Will Likely Backfire by (EA Forum;
- My Most Likely Reason to Die Young is AI X-Risk by 4 Jul 2022 17:08 UTC; 61 points) (
- There Should Be More Alignment-Driven Startups by 31 May 2024 2:05 UTC; 60 points) (
- Movie Review: Megan by 23 Jan 2023 12:50 UTC; 60 points) (
- Has anyone actually tried to convince Terry Tao or other top mathematicians to work on alignment? by 8 Jun 2022 22:26 UTC; 59 points) (
- The LessWrong Team by 1 Jun 2019 0:43 UTC; 59 points) (
- The LessWrong 2022 Review: Review Phase by 22 Dec 2023 3:23 UTC; 58 points) (
- MIRI’s “Death with Dignity” in 60 seconds. by 6 Dec 2022 17:18 UTC; 58 points) (
- My thoughts on direct work (and joining LessWrong) by 16 Aug 2022 18:53 UTC; 57 points) (
- Voting Results for the 2022 Review by 2 Feb 2024 20:34 UTC; 57 points) (
- Deontology and virtue ethics as “effective theories” of consequentialist ethics by 17 Nov 2022 9:20 UTC; 55 points) (EA Forum;
- an Evangelion dialogue explaining the QACI alignment plan by 10 Jun 2023 3:28 UTC; 54 points) (
- 2022 (and All Time) Posts by Pingback Count by 16 Dec 2023 21:17 UTC; 53 points) (
- LLMs seem (relatively) safe by 25 Apr 2024 22:13 UTC; 53 points) (
- Reflections on My Own Missing Mood by 21 Apr 2022 16:19 UTC; 52 points) (
- AI alignment researchers may have a comparative advantage in reducing s-risks by 15 Feb 2023 13:01 UTC; 49 points) (
- Who Aligns the Alignment Researchers? by 5 Mar 2023 23:22 UTC; 48 points) (
- AI Pause Will Likely Backfire (Guest Post) by 24 Oct 2023 4:30 UTC; 47 points) (
- We Ran an AI Timelines Retreat by 17 May 2022 4:40 UTC; 46 points) (EA Forum;
- The Regulatory Option: A response to near 0% survival odds by 11 Apr 2022 22:00 UTC; 46 points) (
- Safety timelines: How long will it take to solve alignment? by 19 Sep 2022 12:51 UTC; 45 points) (EA Forum;
- Entering At the 11th Hour (Babble & Anaylsis) by 3 Apr 2022 0:52 UTC; 45 points) (
- Fighting without hope by 1 Mar 2023 18:15 UTC; 44 points) (
- Quick Thoughts on A.I. Governance by 30 Apr 2022 14:49 UTC; 43 points) (EA Forum;
- 16 Sep 2023 16:01 UTC; 42 points) 's comment on AI Pause Will Likely Backfire by (EA Forum;
- Navigating emotions in an uncertain & confusing world by 20 Nov 2023 18:16 UTC; 42 points) (
- The AI guide I’m sending my grandparents by 27 Apr 2023 20:04 UTC; 41 points) (EA Forum;
- Orthogonal: A new agent foundations alignment organization by 19 Apr 2023 20:17 UTC; 38 points) (EA Forum;
- 3 Feb 2023 10:20 UTC; 38 points) 's comment on I don’t think MIRI “gave up” by (
- Turning 22 in the Pre-Apocalypse by 22 Aug 2024 20:28 UTC; 37 points) (
- Safety timelines: How long will it take to solve alignment? by 19 Sep 2022 12:53 UTC; 37 points) (
- Fighting without hope by 1 Mar 2023 18:15 UTC; 35 points) (EA Forum;
- Cultivating Valiance by 13 Aug 2022 18:47 UTC; 35 points) (
- Recruit the World’s best for AGI Alignment by 30 Mar 2023 16:41 UTC; 34 points) (EA Forum;
- Connor Leahy on Conjecture and Dying with Dignity by 22 Jul 2022 19:30 UTC; 34 points) (EA Forum;
- Navigating emotions in an uncertain & confusing world by 20 Nov 2023 18:16 UTC; 33 points) (EA Forum;
- Rationalist Should Win. Not Dying with Dignity and Funding WBE. by 12 Apr 2022 2:14 UTC; 32 points) (
- Accurate Models of AI Risk Are Hyperexistential Exfohazards by 25 Dec 2022 16:50 UTC; 31 points) (
- Two Tales of AI Takeover: My Doubts by 5 Mar 2024 15:51 UTC; 30 points) (
- What are MIRI’s big achievements in AI alignment? by 7 Mar 2023 21:30 UTC; 29 points) (
- Foresight for AGI Safety Strategy: Mitigating Risks and Identifying Golden Opportunities by 5 Dec 2022 16:09 UTC; 28 points) (
- 20 Sep 2023 22:12 UTC; 27 points) 's comment on The Case for AI Safety Advocacy to the Public by (EA Forum;
- Longevity research as AI X-risk intervention by 6 Nov 2022 17:58 UTC; 27 points) (EA Forum;
- logical vs indexical dignity by 19 Nov 2022 12:43 UTC; 27 points) (
- AGI in sight: our look at the game board by 18 Feb 2023 22:17 UTC; 25 points) (EA Forum;
- My Alignment “Plan”: Avoid Strong Optimisation and Align Economy by 31 Jan 2024 17:03 UTC; 24 points) (
- Who Aligns the Alignment Researchers? by 5 Mar 2023 23:22 UTC; 23 points) (EA Forum;
- 23 Jun 2022 3:26 UTC; 23 points) 's comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- 30 Mar 2023 9:48 UTC; 23 points) 's comment on Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky by (EA Forum;
- AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by 3 May 2023 20:26 UTC; 23 points) (
- Being at peace with Doom by 9 Apr 2023 14:53 UTC; 23 points) (
- 4 Apr 2024 21:40 UTC; 22 points) 's comment on [April Fools’ Day] Introducing Open Asteroid Impact by (EA Forum;
- 9 Apr 2023 12:13 UTC; 22 points) 's comment on A decade of lurking, a month of posting by (
- AXRP Episode 20 - ‘Reform’ AI Alignment with Scott Aaronson by 12 Apr 2023 21:30 UTC; 22 points) (
- Orthogonal’s Formal-Goal Alignment theory of change by 5 May 2023 22:36 UTC; 21 points) (EA Forum;
- FLI And Eliezer Should Reach Consensus by 11 Apr 2023 4:07 UTC; 21 points) (
- The great energy descent—Part 2: Limits to growth and why we probably won’t reach the stars by 31 Aug 2022 21:51 UTC; 19 points) (EA Forum;
- Cheat sheet of AI X-risk by 29 Jun 2023 4:28 UTC; 19 points) (
- 22 Jul 2022 10:09 UTC; 18 points) 's comment on Reasons I’ve been hesitant about high levels of near-ish AI risk by (EA Forum;
- a casual intro to AI doom and alignment by 1 Nov 2022 16:38 UTC; 18 points) (
- Wizards and prophets of AI [draft for comment] by 31 Mar 2023 20:22 UTC; 16 points) (
- Pop Culture Alignment Research and Taxes by 16 Apr 2022 15:45 UTC; 16 points) (
- [Incomplete] What is Computation Anyway? by 14 Dec 2022 16:17 UTC; 16 points) (
- An AI-in-a-box success model by 11 Apr 2022 22:28 UTC; 16 points) (
- Being at peace with Doom by 9 Apr 2023 15:01 UTC; 15 points) (EA Forum;
- 5 Apr 2022 10:59 UTC; 15 points) 's comment on What an actually pessimistic containment strategy looks like by (
- AGI goal space is big, but narrowing might not be as hard as it seems. by 12 Apr 2023 19:03 UTC; 15 points) (
- Foresight for AGI Safety Strategy by 5 Dec 2022 16:09 UTC; 14 points) (EA Forum;
- What are some ways in which we can die with more dignity? by 3 Apr 2022 5:32 UTC; 14 points) (
- 4 Sep 2023 10:52 UTC; 13 points) 's comment on The Economics of the Asteroid Deflection Problem (Dominant Assurance Contracts) by (
- a casual intro to AI doom and alignment by 2 Nov 2022 9:42 UTC; 12 points) (EA Forum;
- 23 Jun 2022 5:30 UTC; 12 points) 's comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- Against Agents as an Approach to Aligned Transformative AI by 27 Dec 2022 0:47 UTC; 12 points) (
- Capability and Agency as Cornerstones of AI risk — My current model by 15 Sep 2022 8:25 UTC; 10 points) (
- Wizards and prophets of AI [draft for comment] by 31 Mar 2023 20:22 UTC; 9 points) (Progress Forum;
- 22 Mar 2023 15:07 UTC; 9 points) 's comment on Droopyhammock’s Shortform by (
- Strategies for keeping AIs narrow in the short term by 9 Apr 2022 16:42 UTC; 9 points) (
- 29 Jan 2023 10:44 UTC; 8 points) 's comment on My highly personal skepticism braindump on existential risk from artificial intelligence. by (EA Forum;
- 12 Jul 2022 9:54 UTC; 8 points) 's comment on On how various plans miss the hard bits of the alignment challenge by (EA Forum;
- Why is Toby Ord’s likelihood of human extinction due to AI so low? by 5 Apr 2022 12:16 UTC; 8 points) (
- 11 Jun 2022 22:29 UTC; 8 points) 's comment on Godzilla Strategies by (
- 5 Apr 2022 2:33 UTC; 8 points) 's comment on What an actually pessimistic containment strategy looks like by (
- An Eternal Company by 6 Jun 2023 15:56 UTC; 7 points) (
- Many-Worlds Interpretation and Death with Dignity by 4 Apr 2022 16:19 UTC; 7 points) (
- 30 May 2022 20:47 UTC; 7 points) 's comment on Reshaping the AI Industry by (
- 25 Jan 2023 20:13 UTC; 6 points) 's comment on My highly personal skepticism braindump on existential risk from artificial intelligence. by (EA Forum;
- 31 Jan 2023 6:58 UTC; 6 points) 's comment on My highly personal skepticism braindump on existential risk from artificial intelligence. by (EA Forum;
- 15 Jun 2022 0:09 UTC; 6 points) 's comment on Expected impact of a career in AI safety under different opinions by (EA Forum;
- How to develop a photographic memory 3/3 by 8 Feb 2024 9:22 UTC; 6 points) (
- An Average Dialogue by 1 Apr 2023 4:01 UTC; 5 points) (EA Forum;
- 12 Apr 2022 21:47 UTC; 5 points) 's comment on 13 ideas for new Existential Risk Movies & TV Shows – what are your ideas? by (EA Forum;
- 1 Apr 2023 18:53 UTC; 5 points) 's comment on Meta Directory of April Fool’s Ideas by (EA Forum;
- 11 Apr 2022 20:53 UTC; 5 points) 's comment on April 2022 Welcome & Open Thread by (
- 9 Apr 2022 18:58 UTC; 5 points) 's comment on AI safety: the ultimate trolley problem by (
- Is it desirable for the first AGI to be conscious? by 1 May 2022 21:29 UTC; 5 points) (
- 4 May 2023 8:08 UTC; 5 points) 's comment on AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by (
- Against Agents as an Approach to Aligned Transformative AI by 27 Dec 2022 0:47 UTC; 4 points) (EA Forum;
- 17 Feb 2023 21:09 UTC; 4 points) 's comment on Recommendation: Bug Bounties and Responsible Disclosure for Advanced ML Systems by (
- An Average Dialogue by 1 Apr 2023 4:01 UTC; 4 points) (
- Please help us communicate AI xrisk. It could save the world. by 4 Jul 2022 21:47 UTC; 4 points) (
- 15 Mar 2023 18:37 UTC; 3 points) 's comment on Just Pivot to AI: The secret is out by (EA Forum;
- 17 Nov 2022 18:58 UTC; 3 points) 's comment on Samo Burja: What the collapse of FTX means for effective altruism by (EA Forum;
- 9 May 2022 9:57 UTC; 3 points) 's comment on A tale of 2.5 orthogonality theses by (EA Forum;
- 3 Apr 2022 14:19 UTC; 3 points) 's comment on Entering At the 11th Hour (Babble & Anaylsis) by (
- Death with Awesomeness by 1 Apr 2024 20:24 UTC; 3 points) (
- 19 Jul 2022 5:26 UTC; 3 points) 's comment on What should you change in response to an “emergency”? And AI risk by (
- 2 Oct 2023 10:29 UTC; 3 points) 's comment on AI Alignment Breakthroughs this Week [new substack] by (
- 13 Jan 2023 11:25 UTC; 2 points) 's comment on The Effective Altruism movement is not above conflicts of interest by (EA Forum;
- 30 Jun 2023 6:42 UTC; 2 points) 's comment on AGI x Animal Welfare: A High-EV Outreach Opportunity? by (EA Forum;
- 18 Jan 2023 12:05 UTC; 2 points) 's comment on Help me to understand AI alignment! by (EA Forum;
- Please stop publishing ideas/insights/research about AI by 2 May 2024 14:52 UTC; 1 point) (EA Forum;
- 25 May 2022 7:44 UTC; 1 point) 's comment on We Ran an AI Timelines Retreat by (EA Forum;
- 20 Sep 2022 8:46 UTC; 1 point) 's comment on The Base Rate of Longtermism Is Bad by (EA Forum;
- 1 Apr 2024 23:44 UTC; 1 point) 's comment on God Coin: A Modest Proposal by (
- Please stop publishing ideas/insights/research about AI by 2 May 2024 14:54 UTC; 1 point) (
- 25 Jun 2022 10:36 UTC; 1 point) 's comment on Dependencies for AGI pessimism by (
- 9 Jun 2022 22:00 UTC; 1 point) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- 1 Apr 2023 3:50 UTC; 1 point) 's comment on Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky by (
- A flaw in the A.G.I. Ruin Argument by 19 May 2023 19:40 UTC; 1 point) (
- 21 Jun 2024 0:25 UTC; 0 points) 's comment on I would have shit in that alley, too by (
- Ethical Awesomeness by 20 Apr 2022 23:17 UTC; -1 points) (EA Forum;
- What if we stopped making GPUs for a bit? by 5 Apr 2022 23:02 UTC; -3 points) (
- What does Eliezer Yudkowsky think of the meaning of life now? by 11 Apr 2024 18:36 UTC; -7 points) (
- We still have many ways to fight for AGI aligning: it’s too early to capitulate. by 21 Aug 2024 20:06 UTC; -9 points) (
- No Fire in the Equations by 28 Jan 2023 21:16 UTC; -16 points) (
Based on occasional conversations with new people, I would not be surprised if a majority of people who got into alignment between April 2022 and April 2023 did so mainly because of this post. Most of them say something like “man, I did not realize how dire the situation looked” or “I thought the MIRI folks were on it or something”.
This seems quite unlikely to be true to me, but might depend on what you consider to be “got into alignment”. (Or, if you are weighting by importance, it might depend on the weighting.)
I don’t know about most, which seems like a high bar, but it’s been a very common thing I’ve heard among newcomers I’ve interfaced with in the last 1-2 years. It’s definitely the single post I’ve seen most frequently cited.