AGI Ruin: A List of Lethalities
Preamble:
(If you’re already familiar with all basics and don’t want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)
I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren’t addressed immediately, and I address different points first instead.
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I’m not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.
Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities:
-3. I’m assuming you are already familiar with some basics, and already know what ‘orthogonality’ and ‘instrumental convergence’ are and why they’re true. People occasionally claim to me that I need to stop fighting old wars here, because, those people claim to me, those wars have already been won within the important-according-to-them parts of the current audience. I suppose it’s at least true that none of the current major EA funders seem to be visibly in denial about orthogonality or instrumental convergence as such; so, fine. If you don’t know what ‘orthogonality’ or ‘instrumental convergence’ are, or don’t see for yourself why they’re true, you need a different introduction than this one.
-2. When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get. So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it. Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as “less than roughly certain to kill everybody”, then you can probably get down to under a 5% chance with only slightly more effort. Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”. Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment. At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’. Anybody telling you I’m asking for stricter ‘alignment’ than this has failed at reading comprehension. The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.
-1. None of this is about anything being impossible in principle. The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months. For people schooled in machine learning, I use as my metaphor the difference between ReLU activations and sigmoid activations. Sigmoid activations are complicated and fragile, and do a terrible job of transmitting gradients through many layers; ReLUs are incredibly simple (for the unfamiliar, the activation function is literally max(x, 0)) and work much better. Most neural networks for the first decades of the field used sigmoids; the idea of ReLUs wasn’t discovered, validated, and popularized until decades later. What’s lethal is that we do not have the Textbook From The Future telling us all the simple solutions that actually in real life just work and are robust; we’re going to be doing everything with metaphorical sigmoids on the first critical try. No difficulty discussed here about AGI alignment is claimed by me to be impossible—to merely human science and engineering, let alone in principle—if we had 100 years to solve it using unlimited retries, the way that science usually has an unbounded time budget and unlimited retries. This list of lethalities is about things we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle.
That said:
Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable on anything remotely remotely resembling the current pathway, or any other pathway we can easily jump to.
Section A:
This is a very lethal problem, it has to be solved one way or another, it has to be solved at a minimum strength and difficulty level instead of various easier modes that some dream about, we do not have any visible option of ‘everyone’ retreating to only solve safe weak problems instead, and failing on the first really dangerous try is fatal.
1. Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games. Anyone relying on “well, it’ll get up to human capability at Go, but then have a hard time getting past that because it won’t be able to learn from humans any more” would have relied on vacuum. AGI will not be upper-bounded by human ability or human learning speed. Things much smarter than human would be able to learn from less evidence than humans require to have ideas driven into their brains; there are theoretical upper bounds here, but those upper bounds seem very high. (Eg, each bit of information that couldn’t already be fully predicted can eliminate at most half the probability mass of all hypotheses under consideration.) It is not naturally (by default, barring intervention) the case that everything takes place on a timescale that makes it easy for us to react.
2. A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure. The concrete example I usually use here is nanotech, because there’s been pretty detailed analysis of what definitely look like physically attainable lower bounds on what should be possible with nanotech, and those lower bounds are sufficient to carry the point. My lower-bound model of “how a sufficiently powerful intelligence would kill everyone, if it didn’t want to not do that” is that it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they’re dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery. (Back when I was first deploying this visualization, the wise-sounding critics said “Ah, but how do you know even a superintelligence could solve the protein folding problem, if it didn’t already have planet-sized supercomputers?” but one hears less of this after the advent of AlphaFold 2, for some odd reason.) The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth’s atmosphere, get into human bloodstreams and hide, strike on a timer. Losing a conflict with a high-powered cognitive system looks at least as deadly as “everybody on the face of the Earth suddenly falls over dead within the same second”. (I am using awkward constructions like ‘high cognitive power’ because standard English terms like ‘smart’ or ‘intelligent’ appear to me to function largely as status synonyms. ‘Superintelligence’ sounds to most people like ‘something above the top of the status hierarchy that went to double college’, and they don’t understand why that would be all that dangerous? Earthlings have no word and indeed no standard native concept that means ‘actually useful cognitive power’. A large amount of failure to panic sufficiently, seems to me to stem from a lack of appreciation for the incredible potential lethality of this thing that Earthlings as a culture have not named.)
3. We need to get alignment right on the ‘first critical try’ at operating at a ‘dangerous’ level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don’t get to try again. This includes, for example: (a) something smart enough to build a nanosystem which has been explicitly authorized to build a nanosystem; or (b) something smart enough to build a nanosystem and also smart enough to gain unauthorized access to the Internet and pay a human to put together the ingredients for a nanosystem; or (c) something smart enough to get unauthorized access to the Internet and build something smarter than itself on the number of machines it can hack; or (d) something smart enough to treat humans as manipulable machinery and which has any authorized or unauthorized two-way causal channel with humans; or (e) something smart enough to improve itself enough to do (b) or (d); etcetera. We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but once we are running more powerful systems, we can no longer update on sufficiently catastrophic errors. This is where practically all of the real lethality comes from, that we have to get things right on the first sufficiently-critical try. If we had unlimited retries—if every time an AGI destroyed all the galaxies we got to go back in time four years and try again—we would in a hundred years figure out which bright ideas actually worked. Human beings can figure out pretty difficult things over time, when they get lots of tries; when a failed guess kills literally everyone, that is harder. That we have to get a bunch of key stuff right on the first try is where most of the lethality really and ultimately comes from; likewise the fact that no authority is here to tell us a list of what exactly is ‘key’ and will kill us if we get it wrong. (One remarks that most people are so absolutely and flatly unprepared by their ‘scientific’ educations to challenge pre-paradigmatic puzzles with no scholarly authoritative supervision, that they do not even realize how much harder that is, or how incredibly lethal it is to demand getting that right on the first critical try.)
4. We can’t just “decide not to build AGI” because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world. The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit—it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth. The current state of this cooperation to have every big actor refrain from doing the stupid thing, is that at present some large actors with a lot of researchers and computing power are led by people who vocally disdain all talk of AGI safety (eg Facebook AI Research). Note that needing to solve AGI alignment only within a time limit, but with unlimited safe retries for rapid experimentation on the full-powered system; or only on the first critical try, but with an unlimited time bound; would both be terrifically humanity-threatening challenges by historical standards individually.
5. We can’t just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so. I’ve also in the past called this the ‘safe-but-useless’ tradeoff, or ‘safe-vs-useful’. People keep on going “why don’t we only use AIs to do X, that seems safe” and the answer is almost always either “doing X in fact takes very powerful cognition that is not passively safe” or, even more commonly, “because restricting yourself to doing X will not prevent Facebook AI Research from destroying the world six months later”. If all you need is an object that doesn’t do dangerous things, you could try a sponge; a sponge is very passively safe. Building a sponge, however, does not prevent Facebook AI Research from destroying the world six months later when they catch up to the leading actor.
6. We need to align the performance of some large task, a ‘pivotal act’ that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some “pivotal act”, strong enough to flip the gameboard, using an AGI powerful enough to do that. It’s not enough to be able to align a weak system—we need to align a system that can do some single very large thing. The example I usually give is “burn all GPUs”. This is not what I think you’d actually want to do with a powerful AGI—the nanomachines would need to operate in an incredibly complicated open environment to hunt down all the GPUs, and that would be needlessly difficult to align. However, all known pivotal acts are currently outside the Overton Window, and I expect them to stay there. So I picked an example where if anybody says “how dare you propose burning all GPUs?” I can say “Oh, well, I don’t actually advocate doing that; it’s just a mild overestimate for the rough power level of what you’d have to do, and the rough level of machine cognition required to do that, in order to prevent somebody else from destroying the world in six months or three years.” (If it wasn’t a mild overestimate, then ‘burn all GPUs’ would actually be the minimal pivotal task and hence correct answer, and I wouldn’t be able to give that denial.) Many clever-sounding proposals for alignment fall apart as soon as you ask “How could you use this to align a system that you could use to shut down all the GPUs in the world?” because it’s then clear that the system can’t do something that powerful, or, if it can do that, the system wouldn’t be easy to align. A GPU-burner is also a system powerful enough to, and purportedly authorized to, build nanotechnology, so it requires operating in a dangerous domain at a dangerous level of intelligence and capability; and this goes along with any non-fantasy attempt to name a way an AGI could change the world such that a half-dozen other would-be AGI-builders won’t destroy the world 6 months later.
7. The reason why nobody in this community has successfully named a ‘pivotal weak act’ where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later—and yet also we can’t just go do that right now and need to wait on AI—is that nothing like that exists. There’s no reason why it should exist. There is not some elaborate clever reason why it exists but nobody can see it. It takes a lot of power to do something to the current world that prevents any other AGI from coming into existence; nothing which can do that is passively safe in virtue of its weakness. If you can’t solve the problem right now (which you can’t, because you’re opposed to other actors who don’t want to be solved and those actors are on roughly the same level as you) then you are resorting to some cognitive system that can do things you could not figure out how to do yourself, that you were not close to figuring out because you are not close to being able to, for example, burn all GPUs. Burning all GPUs would actually stop Facebook AI Research from destroying the world six months later; weaksauce Overton-abiding stuff about ‘improving public epistemology by setting GPT-4 loose on Twitter to provide scientifically literate arguments about everything’ will be cool but will not actually prevent Facebook AI Research from destroying the world six months later, or some eager open-source collaborative from destroying the world a year later if you manage to stop FAIR specifically. There are no pivotal weak acts.
8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we’d rather the AI not solve; you can’t build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.
9. The builders of a safe system, by hypothesis on such a thing being possible, would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that. Running AGIs doing something pivotal are not passively safe, they’re the equivalent of nuclear cores that require actively maintained design properties to not go supercritical and melt down.
Section B:
Okay, but as we all know, modern machine learning is like a genie where you just give it a wish, right? Expressed as some mysterious thing called a ‘loss function’, but which is basically just equivalent to an English wish phrasing, right? And then if you pour in enough computing power you get your wish, right? So why not train a giant stack of transformer layers on a dataset of agents doing nice things and not bad things, throw in the word ‘corrigibility’ somewhere, crank up that computing power, and get out an aligned AGI?
Section B.1: The distributional leap.
10. You can’t train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions. (Some generalization of this seems like it would have to be true even outside that paradigm; you wouldn’t be working on a live unaligned superintelligence to align it.) This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they’d do, in order to align what output—which is why, of course, they never concretely sketch anything like that. Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn’t kill you. This is where a huge amount of lethality comes from on anything remotely resembling the present paradigm. Unaligned operation at a dangerous level of intelligence*capability will kill you; so, if you’re starting with an unaligned system and labeling outputs in order to get it to learn alignment, the training regime or building regime must be operating at some lower level of intelligence*capability that is passively safe, where its currently-unaligned operation does not pose any threat. (Note that anything substantially smarter than you poses a threat given any realistic level of capability. Eg, “being able to produce outputs that humans look at” is probably sufficient for a generally much-smarter-than-human AGI to navigate its way out of the causal systems that are humans, especially in the real world where somebody trained the system on terabytes of Internet text, rather than somehow keeping it ignorant of the latent causes of its source code and training environments.)
11. If cognitive machinery doesn’t generalize far out of the distribution where you did tons of training, it can’t solve problems on the order of ‘build nanotechnology’ where it would be too expensive to run a million training runs of failing to build nanotechnology. There is no pivotal act this weak; there’s no known case where you can entrain a safe level of ability on a safe environment where you can cheaply do millions of runs, and deploy that capability to save the world and prevent the next AGI project up from destroying the world two years later. Pivotal weak acts like this aren’t known, and not for want of people looking for them. So, again, you end up needing alignment to generalize way out of the training distribution—not just because the training environment needs to be safe, but because the training environment probably also needs to be cheaper than evaluating some real-world domain in which the AGI needs to do some huge act. You don’t get 1000 failed tries at burning all GPUs—because people will notice, even leaving out the consequences of capabilities success and alignment failure.
12. Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes. Problems that materialize at high intelligence and danger levels may fail to show up at safe lower levels of intelligence, or may recur after being suppressed by a first patch.
13. Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability. Consider the internal behavior ‘change your outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over you’. This problem is one that will appear at the superintelligent level; if, being otherwise ignorant, we guess that it is among the median such problems in terms of how early it naturally appears in earlier systems, then around half of the alignment problems of superintelligence will first naturally materialize after that one first starts to appear. Given correct foresight of which problems will naturally materialize later, one could try to deliberately materialize such problems earlier, and get in some observations of them. This helps to the extent (a) that we actually correctly forecast all of the problems that will appear later, or some superset of those; (b) that we succeed in preemptively materializing a superset of problems that will appear later; and (c) that we can actually solve, in the earlier laboratory that is out-of-distribution for us relative to the real problems, those alignment problems that would be lethal if we mishandle them when they materialize later. Anticipating all of the really dangerous ones, and then successfully materializing them, in the correct form for early solutions to generalize over to later solutions, sounds possibly kinda hard.
14. Some problems, like ‘the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment’, seem like their natural order of appearance could be that they first appear only in fully dangerous domains. Really actually having a clear option to brain-level-persuade the operators or escape onto the Internet, build nanotech, and destroy all of humanity—in a way where you’re fully clear that you know the relevant facts, and estimate only a not-worth-it low probability of learning something which changes your preferred strategy if you bide your time another month while further growing in capability—is an option that first gets evaluated for real at the point where an AGI fully expects it can defeat its creators. We can try to manifest an echo of that apparent scenario in earlier toy domains. Trying to train by gradient descent against that behavior, in that toy domain, is something I’d expect to produce not-particularly-coherent local patches to thought processes, which would break with near-certainty inside a superintelligence generalizing far outside the training distribution and thinking very different thoughts. Also, programmers and operators themselves, who are used to operating in not-fully-dangerous domains, are operating out-of-distribution when they enter into dangerous ones; our methodologies may at that time break.
15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. Given otherwise insufficient foresight by the operators, I’d expect a lot of those problems to appear approximately simultaneously after a sharp capability gain. See, again, the case of human intelligence. We didn’t break alignment with the ‘inclusive reproductive fitness’ outer loss function, immediately after the introduction of farming—something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously. (People will perhaps rationalize reasons why this abstract description doesn’t carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”. My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned ‘lethally’ dangerous relative to the outer optimization loop of natural selection. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)
Section B.2: Central difficulties of outer and inner alignment.
16. Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
17. More generally, a superproblem of ‘outer optimization doesn’t produce inner alignment’ is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they’re there, rather than just observable outer ones you can run a loss function over. This is a problem when you’re trying to generalize out of the original training distribution, because, eg, the outer behaviors you see could have been produced by an inner-misaligned system that is deliberately producing outer behaviors that will fool you. We don’t know how to get any bits of information into the inner system rather than the outer behaviors, in any systematic or general way, on the current optimization paradigm.
18. There’s no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is ‘aligned’, because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function. That is, if you show an agent a reward signal that’s currently being generated by humans, the signal is not in general a reliable perfect ground truth about how aligned an action was, because another way of producing a high reward signal is to deceive, corrupt, or replace the human operators with a different causal system which generates that reward signal. When you show an agent an environmental reward signal, you are not showing it something that is a reliable ground truth about whether the system did the thing you wanted it to do; even if it ends up perfectly inner-aligned on that reward signal, or learning some concept that exactly corresponds to ‘wanting states of the environment which result in a high reward signal being sent’, an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (as seen by the operators).
19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward. This isn’t to say that nothing in the system’s goal (whatever goal accidentally ends up being inner-optimized over) could ever point to anything in the environment by accident. Humans ended up pointing to their environments at least partially, though we’ve got lots of internally oriented motivational pointers as well. But insofar as the current paradigm works at all, the on-paper design properties say that it only works for aligning on known direct functions of sense data and reward functions. All of these kill you if optimized-over by a sufficiently powerful intelligence, because they imply strategies like ‘kill everyone in the world using nanotech to strike before they know they’re in a battle, and have control of your reward button forever after’. It just isn’t true that we know a function on webcam input such that every world with that webcam showing the right things is safe for us creatures outside the webcam. This general problem is a fact about the territory, not the map; it’s a fact about the actual environment, not the particular optimizer, that lethal-to-us possibilities exist in some possible environments underlying every given sense input.
20. Human operators are fallible, breakable, and manipulable. Human raters make systematic errors—regular, compactly describable, predictable errors. To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them. It’s a fact about the territory, not the map—about the environment, not the optimizer—that the best predictive explanation for human answers is one that predicts the systematic errors in our responses, and therefore is a psychological concept that correctly predicts the higher scores that would be assigned to human-error-producing cases.
21. There’s something like a single answer, or a single bucket of answers, for questions like ‘What’s the environment really like?’ and ‘How do I figure out the environment?’ and ‘Which of my possible outputs interact with reality in a way that causes reality to have certain properties?‘, where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn’t ‘hit back’ against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of ‘relative inclusive reproductive fitness’ - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you’d expect to be true about outer optimization loops on the order of both ‘natural selection’ and ‘gradient descent’. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.
22. There’s a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer. The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon. There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find ‘want inclusive reproductive fitness’ as a well-generalizing solution within ancestral humans. Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.
23. Corrigibility is anti-natural to consequentialist reasoning; “you can’t bring the coffee if you’re dead” for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence.
24. There are two fundamentally different approaches you can potentially take to alignment, which are unsolvable for two different sets of reasons; therefore, by becoming confused and ambiguating between the two approaches, you can confuse yourself about whether alignment is necessarily difficult. The first approach is to build a CEV-style Sovereign which wants exactly what we extrapolated-want and is therefore safe to let optimize all the future galaxies without it accepting any human input trying to stop it. The second course is to build corrigible AGI which doesn’t want exactly what we want, and yet somehow fails to kill us and take over the galaxies despite that being a convergent incentive there.
The first thing generally, or CEV specifically, is unworkable because the complexity of what needs to be aligned or meta-aligned for our Real Actual Values is far out of reach for our FIRST TRY at AGI. Yes I mean specifically that the dataset, meta-learning algorithm, and what needs to be learned, is far out of reach for our first try. It’s not just non-hand-codable, it is unteachable on-the-first-try because the thing you are trying to teach is too weird and complicated.
The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You’re not trying to make it have an opinion on something the core was previously neutral on. You’re trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it’s incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.
Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.
25. We’ve got no idea what’s actually going on inside the giant inscrutable matrices and tensors of floating-point numbers. Drawing interesting graphs of where a transformer layer is focusing attention doesn’t help if the question that needs answering is “So was it planning how to kill us or not?”
26. Even if we did know what was going on inside the giant inscrutable matrices while the AGI was still too weak to kill us, this would just result in us dying with more dignity, if DeepMind refused to run that system and let Facebook AI Research destroy the world two years later. Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system of inscrutable matrices that isn’t planning to kill us.
27. When you explicitly optimize against a detector of unaligned thoughts, you’re partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.
28. The AGI is smarter than us in whatever domain we’re trying to operate it inside, so we cannot mentally check all the possibilities it examines, and we cannot see all the consequences of its outputs using our own mental talent. A powerful AI searches parts of the option space we don’t, and we can’t foresee all its options.
29. The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI’s output to determine whether the consequences will be good.
30. Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don’t know so that it can make plans we wouldn’t be able to make ourselves. It knows, at the least, the fact we didn’t previously know, that some action sequence results in the world we want. Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence. An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn’t make the same guarantee about an unaligned human as smart as yourself and trying to fool you. There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.
31. A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can’t rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it’s acquired strategic awareness.)
32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can’t be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.
33. The AI does not think like you do, the AI doesn’t have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien—nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.
Section B.4: Miscellaneous unworkable schemes.
34. Coordination schemes between superintelligences are not things that humans can participate in (eg because humans can’t reason reliably about the code of superintelligences); a “multipolar” system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like “the 20 superintelligences cooperate with each other but not with humanity”.
35. Schemes for playing “different” AIs off against each other stop working if those AIs advance to the point of being able to coordinate via reasoning about (probability distributions over) each others’ code. Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you’re playing them against each other. Eg, if you set an AGI that is secretly a paperclip maximizer, to check the output of a nanosystems designer that is secretly a staples maximizer, then even if the nanosystems designer is not able to deduce what the paperclip maximizer really wants (namely paperclips), it could still logically commit to share half the universe with any agent checking its designs if those designs were allowed through, if the checker-agent can verify the suggester-system’s logical commitment and hence logically depend on it (which excludes human-level intelligences). Or, if you prefer simplified catastrophes without any logical decision theory, the suggester could bury in its nanosystem design the code for a new superintelligence that will visibly (to a superhuman checker) divide the universe between the nanosystem designer and the design-checker.
36. What makes an air conditioner ‘magic’ from the perspective of say the thirteenth century, is that even if you correctly show them the design of the air conditioner in advance, they won’t be able to understand from seeing that design why the air comes out cold; the design is exploiting regularities of the environment, rules of the world, laws of physics, that they don’t know about. The domain of human thought and human brains is very poorly understood by us, and exhibits phenomena like optical illusions, hypnosis, psychosis, mania, or simple afterimages produced by strong stimuli in one place leaving neural effects in another place. Maybe a superintelligence couldn’t defeat a human in a very simple realm like logical tic-tac-toe; if you’re fighting it in an incredibly complicated domain you understand poorly, like human minds, you should expect to be defeated by ‘magic’ in the sense that even if you saw its strategy you would not understand why that strategy worked. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.
Section C:
Okay, those are some significant problems, but lots of progress is being made on solving them, right? There’s a whole field calling itself “AI Safety” and many major organizations are expressing Very Grave Concern about how “safe” and “ethical” they are?
37. There’s a pattern that’s played out quite often, over all the times the Earth has spun around the Sun, in which some bright-eyed young scientist, young engineer, young entrepreneur, proceeds in full bright-eyed optimism to challenge some problem that turns out to be really quite difficult. Very often the cynical old veterans of the field try to warn them about this, and the bright-eyed youngsters don’t listen, because, like, who wants to hear about all that stuff, they want to go solve the problem! Then this person gets beaten about the head with a slipper by reality as they find out that their brilliant speculative theory is wrong, it’s actually really hard to build the thing because it keeps breaking, and society isn’t as eager to adopt their clever innovation as they might’ve hoped, in a process which eventually produces a new cynical old veteran. Which, if not literally optimal, is I suppose a nice life cycle to nod along to in a nature-show sort of way. Sometimes you do something for the first time and there are no cynical old veterans to warn anyone and people can be really optimistic about how it will go; eg the initial Dartmouth Summer Research Project on Artificial Intelligence in 1956: “An attempt will be made to find how to make machines use language, form abstractions and concepts, solve kinds of problems now reserved for humans, and improve themselves. We think that a significant advance can be made in one or more of these problems if a carefully selected group of scientists work on it together for a summer.” This is less of a viable survival plan for your planet if the first major failure of the bright-eyed youngsters kills literally everyone before they can predictably get beaten about the head with the news that there were all sorts of unforeseen difficulties and reasons why things were hard. You don’t get any cynical old veterans, in this case, because everybody on Earth is dead. Once you start to suspect you’re in that situation, you have to do the Bayesian thing and update now to the view you will predictably update to later: realize you’re in a situation of being that bright-eyed person who is going to encounter Unexpected Difficulties later and end up a cynical old veteran—or would be, except for the part where you’ll be dead along with everyone else. And become that cynical old veteran right away, before reality whaps you upside the head in the form of everybody dying and you not getting to learn. Everyone else seems to feel that, so long as reality hasn’t whapped them upside the head yet and smacked them down with the actual difficulties, they’re free to go on living out the standard life-cycle and play out their role in the script and go on being bright-eyed youngsters; there’s no cynical old veterans to warn them otherwise, after all, and there’s no proof that everything won’t go beautifully easy and fine, given their bright-eyed total ignorance of what those later difficulties could be.
38. It does not appear to me that the field of ‘AI safety’ is currently being remotely productive on tackling its enormous lethal problems. These problems are in fact out of reach; the contemporary field of AI safety has been selected to contain people who go to work in that field anyways. Almost all of them are there to tackle problems on which they can appear to succeed and publish a paper claiming success; if they can do that and get funded, why would they embark on a much more unpleasant project of trying something harder that they’ll fail at, just so the human species can die with marginally more dignity? This field is not making real progress and does not have a recognition function to distinguish real progress if it took place. You could pump a billion dollars into it and it would produce mostly noise to drown out what little progress was being made elsewhere.
39. I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them. This ability to “notice lethal difficulties without Eliezer Yudkowsky arguing you into noticing them” currently is an opaque piece of cognitive machinery to me, I do not know how to train it into others. It probably relates to ‘security mindset’, and a mental motion where you refuse to play out scripts, and being able to operate in a field that’s in a state of chaos.
40. “Geniuses” with nice legible accomplishments in fields with tight feedback loops where it’s easy to determine which results are good or bad right away, and so validate that this person is a genius, are (a) people who might not be able to do equally great work away from tight feedback loops, (b) people who chose a field where their genius would be nicely legible even if that maybe wasn’t the place where humanity most needed a genius, and (c) probably don’t have the mysterious gears simply because they’re rare. You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them. They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can’t tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do. I concede that real high-powered talents, especially if they’re still in their 20s, genuinely interested, and have done their reading, are people who, yeah, fine, have higher probabilities of making core contributions than a random bloke off the street. But I’d have more hope—not significant hope, but more hope—in separating the concerns of (a) credibly promising to pay big money retrospectively for good work to anyone who produces it, and (b) venturing prospective payments to somebody who is predicted to maybe produce good work later.
41. Reading this document cannot make somebody a core alignment researcher. That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It’s guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so. Some of that silence may, possibly, optimistically, be due to nobody else in this field having the ability to write things comprehensibly—such that somebody out there had the knowledge to write all of this themselves, if they could only have written it up, but they couldn’t write, so didn’t try. I’m not particularly hopeful of this turning out to be true in real life, but I suppose it’s one possible place for a “positive model violation” (miracle). The fact that, twenty-one years into my entering this death game, seven years into other EAs noticing the death game, and two years into even normies starting to notice the death game, it is still Eliezer Yudkowsky writing up this list, says that humanity still has only one gamepiece that can do that. I knew I did not actually have the physical stamina to be a star researcher, I tried really really hard to replace myself before my health deteriorated further, and yet here I am writing this. That’s not what surviving worlds look like.
42. There’s no plan. Surviving worlds, by this point, and in fact several decades earlier, have a plan for how to survive. It is a written plan. The plan is not secret. In this non-surviving world, there are no candidate plans that do not immediately fall to Eliezer instantly pointing at the giant visible gaping holes in that plan. Or if you don’t know who Eliezer is, you don’t even realize you need a plan, because, like, how would a human being possibly realize that without Eliezer yelling at them? It’s not like people will yell at themselves about prospective alignment difficulties, they don’t have an internal voice of caution. So most organizations don’t have plans, because I haven’t taken the time to personally yell at them. ‘Maybe we should have a plan’ is deeper alignment mindset than they possess without me standing constantly on their shoulder as their personal angel pleading them into… continued noncompliance, in fact. Relatively few are aware even that they should, to look better, produce a pretend plan that can fool EAs too ‘modest’ to trust their own judgments about seemingly gaping holes in what serious-looking people apparently believe.
43. This situation you see when you look around you is not what a surviving world looks like. The worlds of humanity that survive have plans. They are not leaving to one tired guy with health problems the entire responsibility of pointing out real and lethal problems proactively. Key people are taking internal and real responsibility for finding flaws in their own plans, instead of considering it their job to propose solutions and somebody else’s job to prove those solutions wrong. That world started trying to solve their important lethal problems earlier than this. Half the people going into string theory shifted into AI alignment instead and made real progress there. When people suggest a planetarily-lethal problem that might materialize later—there’s a lot of people suggesting those, in the worlds destined to live, and they don’t have a special status in the field, it’s just what normal geniuses there do—they’re met with either solution plans or a reason why that shouldn’t happen, not an uncomfortable shrug and ‘How can you be sure that will happen’ / ‘There’s no way you could be sure of that now, we’ll have to wait on experimental evidence.’
A lot of those better worlds will die anyways. It’s a genuinely difficult problem, to solve something like that on your first try. But they’ll die with more dignity than this.
- Where I agree and disagree with Eliezer by Jun 19, 2022, 7:15 PM; 898 points) (
- (The) Lightcone is nothing without its people: LW + Lighthaven’s big fundraiser by Nov 30, 2024, 2:55 AM; 609 points) (
- Let’s think about slowing down AI by Dec 22, 2022, 5:40 PM; 551 points) (
- Focus on the places where you feel shocked everyone’s dropping the ball by Feb 2, 2023, 12:27 AM; 454 points) (
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by Aug 29, 2022, 1:23 AM; 413 points) (
- DeepMind alignment team opinions on AGI ruin arguments by Aug 12, 2022, 9:06 PM; 395 points) (
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by Mar 21, 2023, 12:06 AM; 358 points) (
- My take on What We Owe the Future by Sep 1, 2022, 6:07 PM; 354 points) (EA Forum;
- Cyborgism by Feb 10, 2023, 2:47 PM; 339 points) (
- Why I think strong general AI is coming soon by Sep 28, 2022, 5:40 AM; 336 points) (
- Let’s think about slowing down AI by Dec 23, 2022, 7:56 PM; 334 points) (EA Forum;
- Against Almost Every Theory of Impact of Interpretability by Aug 17, 2023, 6:44 PM; 329 points) (
- The Field of AI Alignment: A Postmortem, and What To Do About It by Dec 26, 2024, 6:48 PM; 293 points) (
- Hooray for stepping out of the limelight by Apr 1, 2023, 2:45 AM; 281 points) (
- My AI Model Delta Compared To Yudkowsky by Jun 10, 2024, 4:12 PM; 278 points) (
- On green by Mar 21, 2024, 5:38 PM; 268 points) (
- Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Apr 8, 2023, 12:36 AM; 267 points) (
- So, geez there’s a lot of AI content these days by Oct 6, 2022, 9:32 PM; 258 points) (
- AGI Safety FAQ / all-dumb-questions-allowed thread by Jun 7, 2022, 5:47 AM; 227 points) (
- MIRI 2024 Mission and Strategy Update by Jan 5, 2024, 12:20 AM; 223 points) (
- A History of the Future, 2025-2040 by Feb 17, 2025, 12:03 PM; 219 points) (
- “Sharp Left Turn” discourse: An opinionated review by Jan 28, 2025, 6:47 PM; 205 points) (
- Jun 20, 2022, 4:43 AM; 205 points) 's comment on Where I agree and disagree with Eliezer by (
- Mechanisms too simple for humans to design by Jan 22, 2025, 4:54 PM; 199 points) (
- Evaluating the historical value misspecification argument by Oct 5, 2023, 6:34 PM; 189 points) (
- The basic reasons I expect AGI ruin by Apr 18, 2023, 3:37 AM; 187 points) (
- Critical review of Christiano’s disagreements with Yudkowsky by Dec 27, 2023, 4:02 PM; 174 points) (
- The inordinately slow spread of good AGI conversations in ML by Jun 21, 2022, 4:09 PM; 173 points) (
- AI #1: Sydney and Bing by Feb 21, 2023, 2:00 PM; 171 points) (
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by Mar 21, 2023, 1:23 AM; 166 points) (EA Forum;
- On A List of Lethalities by Jun 13, 2022, 12:30 PM; 165 points) (
- Slack matters more than any outcome by Dec 31, 2022, 8:11 PM; 163 points) (
- Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI by Jan 26, 2024, 7:22 AM; 161 points) (
- “Diamondoid bacteria” nanobots: deadly threat or dead-end? A nanotech investigation by Sep 29, 2023, 2:01 PM; 160 points) (
- grey goo is unlikely by Apr 17, 2023, 1:59 AM; 158 points) (
- MIRI 2024 Mission and Strategy Update by Jan 5, 2024, 1:10 AM; 154 points) (EA Forum;
- Deep atheism and AI risk by Jan 4, 2024, 6:58 PM; 151 points) (
- OpenAI Launches Superalignment Taskforce by Jul 11, 2023, 1:00 PM; 149 points) (
- Comments on OpenAI’s “Planning for AGI and beyond” by Mar 3, 2023, 11:01 PM; 148 points) (
- POC || GTFO culture as partial antidote to alignment wordcelism by Mar 15, 2023, 10:21 AM; 148 points) (
- 0. CAST: Corrigibility as Singular Target by Jun 7, 2024, 10:29 PM; 145 points) (
- The Most Forbidden Technique by Mar 12, 2025, 1:20 PM; 143 points) (
- AI Pause Will Likely Backfire by Sep 16, 2023, 10:21 AM; 141 points) (EA Forum;
- Contra EY: Can AGI destroy us without trial & error? by Jun 13, 2022, 6:26 PM; 137 points) (
- Four mindset disagreements behind existential risk disagreements in ML by Apr 11, 2023, 4:53 AM; 136 points) (
- Will Capabilities Generalise More? by Jun 29, 2022, 5:12 PM; 133 points) (
- Superintelligent AI is necessary for an amazing future, but far from sufficient by Oct 31, 2022, 9:16 PM; 132 points) (
- [Linkpost] Some high-level thoughts on the DeepMind alignment team’s strategy by Mar 7, 2023, 11:55 AM; 128 points) (
- Let’s See You Write That Corrigibility Tag by Jun 19, 2022, 9:11 PM; 124 points) (
- High-level hopes for AI alignment by Dec 20, 2022, 2:11 AM; 123 points) (EA Forum;
- Anthropic, and taking “technical philosophy” more seriously by Mar 13, 2025, 1:48 AM; 121 points) (
- Compendium of problems with RLHF by Jan 29, 2023, 11:40 AM; 120 points) (
- What is the current most representative EA AI x-risk argument? by Dec 15, 2023, 10:04 PM; 117 points) (EA Forum;
- What is the current most representative EA AI x-risk argument? by Dec 15, 2023, 10:04 PM; 117 points) (EA Forum;
- Comments on OpenAI’s “Planning for AGI and beyond” by Mar 3, 2023, 11:01 PM; 115 points) (EA Forum;
- The Leopold Model: Analysis and Reactions by Jun 14, 2024, 3:10 PM; 108 points) (
- A transcript of the TED talk by Eliezer Yudkowsky by Jul 12, 2023, 12:12 PM; 105 points) (
- Hooray for stepping out of the limelight by Apr 1, 2023, 2:45 AM; 103 points) (EA Forum;
- “Diamondoid bacteria” nanobots: deadly threat or dead-end? A nanotech investigation by Sep 29, 2023, 2:01 PM; 102 points) (EA Forum;
- Thoughts on AGI organizations and capabilities work by Dec 7, 2022, 7:46 PM; 102 points) (
- AI #4: Introducing GPT-4 by Mar 21, 2023, 2:00 PM; 101 points) (
- “Deep Learning” Is Function Approximation by Mar 21, 2024, 5:50 PM; 98 points) (
- Truth and Advantage: Response to a draft of “AI safety seems hard to measure” by Mar 22, 2023, 3:36 AM; 98 points) (
- Naive Hypotheses on AI Alignment by Jul 2, 2022, 7:03 PM; 98 points) (
- Pivotal outcomes and pivotal processes by Jun 17, 2022, 11:43 PM; 97 points) (
- Focus on the places where you feel shocked everyone’s dropping the ball by Feb 2, 2023, 12:27 AM; 92 points) (EA Forum;
- An artificially structured argument for expecting AGI ruin by May 7, 2023, 9:52 PM; 91 points) (
- The History, Epistemology and Strategy of Technological Restraint, and lessons for AI (short essay) by Aug 10, 2022, 11:00 AM; 90 points) (EA Forum;
- «Boundaries», Part 3a: Defining boundaries as directed Markov blankets by Oct 30, 2022, 6:31 AM; 90 points) (
- A summary of current work in AI governance by Jun 17, 2023, 4:58 PM; 88 points) (EA Forum;
- In defense of flailing, with foreword by Bill Burr by Jun 17, 2022, 4:40 PM; 88 points) (
- Outreach success: Intro to AI risk that has been successful by Jun 1, 2023, 11:12 PM; 83 points) (
- A Gentle Introduction to Risk Frameworks Beyond Forecasting by Apr 11, 2024, 9:15 AM; 81 points) (EA Forum;
- To Predict What Happens, Ask What Happens by May 31, 2023, 6:30 PM; 81 points) (
- Evolution is a bad analogy for AGI: inner alignment by Aug 13, 2022, 10:15 PM; 79 points) (
- Thoughts on AGI organizations and capabilities work by Dec 7, 2022, 7:46 PM; 77 points) (EA Forum;
- Value fragility and AI takeover by Aug 5, 2024, 9:28 PM; 76 points) (
- A Quick List of Some Problems in AI Alignment As A Field by Jun 21, 2022, 11:23 PM; 75 points) (
- Apr 1, 2023, 11:58 PM; 74 points) 's comment on Announcing drama curfews on the Forum by (EA Forum;
- What does it mean for an AGI to be ‘safe’? by Oct 7, 2022, 4:13 AM; 74 points) (
- MATS AI Safety Strategy Curriculum by Mar 7, 2024, 7:59 PM; 74 points) (
- A Gentle Introduction to Risk Frameworks Beyond Forecasting by Apr 11, 2024, 6:03 PM; 73 points) (
- «Boundaries», Part 3b: Alignment problems in terms of boundaries by Dec 14, 2022, 10:34 PM; 72 points) (
- The Crux List by May 31, 2023, 6:30 PM; 72 points) (
- AGI ruin mostly rests on strong claims about alignment and deployment, not about society by Apr 24, 2023, 1:06 PM; 70 points) (
- Alignment Org Cheat Sheet by Sep 20, 2022, 5:36 PM; 70 points) (
- Some of my disagreements with List of Lethalities by Jan 24, 2023, 12:25 AM; 70 points) (
- Lifeguards by Jun 10, 2022, 9:12 PM; 69 points) (EA Forum;
- We’re all in this together by Dec 5, 2023, 1:57 PM; 69 points) (
- What is it to solve the alignment problem? (Notes) by Aug 24, 2024, 9:19 PM; 69 points) (
- Resources that (I think) new alignment researchers should know about by Oct 28, 2022, 10:13 PM; 69 points) (
- AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by May 2, 2023, 10:17 AM; 68 points) (EA Forum;
- It’s time to worry about online privacy again by Dec 25, 2022, 9:05 PM; 67 points) (
- Exercise: Planmaking, Surprise Anticipation, and “Baba is You” by Feb 24, 2024, 8:33 PM; 67 points) (
- AI #11: In Search of a Moat by May 11, 2023, 3:40 PM; 67 points) (
- Fear of centralized power vs. fear of misaligned AGI: Vitalik Buterin on 80,000 Hours by Aug 5, 2024, 3:38 PM; 66 points) (
- Deep atheism and AI risk by Jan 4, 2024, 6:58 PM; 65 points) (EA Forum;
- AI Regulation May Be More Important Than AI Alignment For Existential Safety by Aug 24, 2023, 11:41 AM; 65 points) (
- Possible miracles by Oct 9, 2022, 6:17 PM; 64 points) (
- Sam Altman on GPT-4, ChatGPT, and the Future of AI | Lex Fridman Podcast #367 by Mar 25, 2023, 7:08 PM; 63 points) (
- Unpacking the dynamics of AGI conflict that suggest the necessity of a premptive pivotal act by Oct 20, 2023, 6:48 AM; 63 points) (
- Existential risk mitigation: What I worry about when there are only bad options by Dec 19, 2022, 3:30 PM; 62 points) (EA Forum;
- Four mindset disagreements behind existential risk disagreements in ML by Apr 11, 2023, 4:53 AM; 61 points) (EA Forum;
- On green by Mar 21, 2024, 5:38 PM; 61 points) (EA Forum;
- My hopes for alignment: Singular learning theory and whole brain emulation by Oct 25, 2023, 6:31 PM; 61 points) (
- We are fighting a shared battle (a call for a different approach to AI Strategy) by Mar 16, 2023, 2:37 PM; 59 points) (EA Forum;
- The inordinately slow spread of good AGI conversations in ML by Jun 29, 2022, 4:02 AM; 59 points) (EA Forum;
- Bandgaps, Brains, and Bioweapons: The limitations of computational science and what it means for AGI by May 26, 2023, 3:57 PM; 59 points) (EA Forum;
- The basic reasons I expect AGI ruin by Apr 18, 2023, 3:37 AM; 58 points) (EA Forum;
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? by Mar 10, 2023, 8:21 AM; 58 points) (
- High-level hopes for AI alignment by Dec 15, 2022, 6:00 PM; 58 points) (
- The LessWrong 2022 Review: Review Phase by Dec 22, 2023, 3:23 AM; 58 points) (
- Genetic fitness is a measure of selection strength, not the selection target by Nov 4, 2023, 7:02 PM; 57 points) (
- Adumbrations on AGI from an outsider by May 24, 2023, 5:41 PM; 57 points) (
- Voting Results for the 2022 Review by Feb 2, 2024, 8:34 PM; 57 points) (
- The “no sandbagging on checkable tasks” hypothesis by Jul 31, 2023, 11:06 PM; 55 points) (
- The Control Problem: Unsolved or Unsolvable? by Jun 2, 2023, 3:42 PM; 55 points) (
- Feb 3, 2023, 3:32 AM; 54 points) 's comment on I don’t think MIRI “gave up” by (
- AI for AI safety by Mar 14, 2025, 3:00 PM; 54 points) (
- Try to solve the hard parts of the alignment problem by Mar 18, 2023, 2:55 PM; 54 points) (
- On “first critical tries” in AI alignment by Jun 5, 2024, 12:19 AM; 54 points) (
- What does it mean for an AGI to be ‘safe’? by Oct 7, 2022, 4:43 AM; 53 points) (EA Forum;
- 2022 (and All Time) Posts by Pingback Count by Dec 16, 2023, 9:17 PM; 53 points) (
- Schelling points in the AGI policy space by Jun 26, 2024, 1:19 PM; 52 points) (
- An Exercise to Build Intuitions on AGI Risk by Jun 7, 2023, 6:35 PM; 52 points) (
- Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Apr 9, 2023, 3:53 PM; 50 points) (EA Forum;
- Shutdown-Seeking AI by May 31, 2023, 10:19 PM; 50 points) (
- Against Yudkowsky’s evolution analogy for AI x-risk [unfinished] by Mar 18, 2025, 1:41 AM; 50 points) (
- AI and the Technological Richter Scale by Sep 4, 2024, 2:00 PM; 50 points) (
- Pivotal outcomes and pivotal processes by Jun 17, 2022, 11:43 PM; 49 points) (EA Forum;
- “The Race to the End of Humanity” – Structural Uncertainty Analysis in AI Risk Models by May 19, 2023, 12:03 PM; 48 points) (EA Forum;
- On the lethality of biased human reward ratings by Nov 17, 2023, 6:59 PM; 48 points) (
- Formalizing the “AI x-risk is unlikely because it is ridiculous” argument by May 3, 2023, 6:56 PM; 48 points) (
- AI Pause Will Likely Backfire (Guest Post) by Oct 24, 2023, 4:30 AM; 47 points) (
- Pitching an Alignment Softball by Jun 7, 2022, 4:10 AM; 47 points) (
- Prize idea: Transmit MIRI and Eliezer’s worldviews by Sep 19, 2022, 9:21 PM; 47 points) (
- AI #68: Remarkably Reasonable Reactions by Jun 13, 2024, 4:30 PM; 46 points) (
- The Alignment Problem by Jul 11, 2022, 3:03 AM; 46 points) (
- Continuity Assumptions by Jun 13, 2022, 9:36 PM; 44 points) (EA Forum;
- A summary of current work in AI governance by Jun 17, 2023, 6:41 PM; 44 points) (
- Continuity Assumptions by Jun 13, 2022, 9:31 PM; 44 points) (
- Expected impact of a career in AI safety under different opinions by Jun 14, 2022, 2:25 PM; 42 points) (EA Forum;
- Upgrading the AI Safety Community by Dec 16, 2023, 3:34 PM; 42 points) (
- A newcomer’s guide to the technical AI safety field by Nov 4, 2022, 2:29 PM; 42 points) (
- Goals selected from learned knowledge: an alternative to RL alignment by Jan 15, 2024, 9:52 PM; 42 points) (
- MATS AI Safety Strategy Curriculum v2 by Oct 7, 2024, 10:44 PM; 42 points) (
- The curious case of Pretty Good human inner/outer alignment by Jul 5, 2022, 7:04 PM; 41 points) (
- Superintelligent AI is possible in the 2020s by Aug 13, 2024, 6:03 AM; 41 points) (
- Sep 9, 2022, 2:40 AM; 40 points) 's comment on Most People Start With The Same Few Bad Ideas by (
- Box inversion revisited by Nov 7, 2023, 11:09 AM; 40 points) (
- A transcript of the TED talk by Eliezer Yudkowsky by Jul 12, 2023, 12:12 PM; 39 points) (EA Forum;
- Possible miracles by Oct 9, 2022, 6:17 PM; 38 points) (EA Forum;
- Value fragility and AI takeover by Aug 5, 2024, 9:28 PM; 38 points) (EA Forum;
- Consider trying Vivek Hebbar’s alignment exercises by Oct 24, 2022, 7:46 PM; 38 points) (
- Cautions about LLMs in Human Cognitive Loops by Mar 2, 2025, 7:53 PM; 38 points) (
- Paths and waystations in AI safety by Mar 11, 2025, 6:52 PM; 38 points) (
- Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas by Nov 25, 2022, 8:47 PM; 37 points) (
- Jun 27, 2022, 1:46 PM; 36 points) 's comment on Linkpost: Robin Hanson—Why Not Wait On AI Risk? by (
- Bandgaps, Brains, and Bioweapons: The limitations of computational science and what it means for AGI by May 26, 2023, 3:57 PM; 36 points) (
- My disagreements with “AGI ruin: A List of Lethalities” by Sep 15, 2024, 5:22 PM; 36 points) (
- Superintelligent AI is necessary for an amazing future, but far from sufficient by Oct 31, 2022, 9:16 PM; 35 points) (EA Forum;
- System 2 Alignment by Feb 13, 2025, 7:17 PM; 35 points) (
- Recruit the World’s best for AGI Alignment by Mar 30, 2023, 4:41 PM; 34 points) (EA Forum;
- AI Safety Strategies Landscape by May 9, 2024, 5:33 PM; 34 points) (
- Types and Degrees of Alignment by May 31, 2023, 6:30 PM; 34 points) (
- Could We Automate AI Alignment Research? by Aug 10, 2023, 12:17 PM; 34 points) (
- Helpful examples to get a sense of modern automated manipulation by Nov 12, 2023, 8:49 PM; 33 points) (
- Will Artificial Superintelligence Kill Us? by May 23, 2023, 4:27 PM; 33 points) (
- A stubborn unbeliever finally gets the depth of the AI alignment problem by Oct 13, 2022, 3:16 PM; 32 points) (EA Forum;
- What is it to solve the alignment problem? (Notes) by Aug 24, 2024, 9:19 PM; 32 points) (EA Forum;
- Jan 14, 2024, 6:43 PM; 32 points) 's comment on Olli Järviniemi’s Shortform by (
- Scaffolded LLMs: Less Obvious Concerns by Jun 16, 2023, 10:39 AM; 32 points) (
- AI Safety Endgame Stories by Sep 28, 2022, 5:12 PM; 31 points) (EA Forum;
- AI Safety Endgame Stories by Sep 28, 2022, 4:58 PM; 31 points) (
- AI #103: Show Me the Money by Feb 13, 2025, 3:20 PM; 30 points) (
- Dec 18, 2023, 11:33 PM; 29 points) 's comment on What is the current most representative EA AI x-risk argument? by (EA Forum;
- On “first critical tries” in AI alignment by Jun 5, 2024, 12:19 AM; 29 points) (EA Forum;
- MATS AI Safety Strategy Curriculum v2 by Oct 7, 2024, 11:01 PM; 29 points) (EA Forum;
- What are MIRI’s big achievements in AI alignment? by Mar 7, 2023, 9:30 PM; 29 points) (
- Pivotal acts using an unaligned AGI? by Aug 21, 2022, 5:13 PM; 28 points) (
- What I learned at the AI Safety Europe Retreat by Apr 17, 2023, 5:40 PM; 28 points) (
- Content and Takeaways from SERI MATS Training Program with John Wentworth by Dec 24, 2022, 4:17 AM; 28 points) (
- How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It) by Aug 10, 2022, 6:14 PM; 28 points) (
- Is AI Alignment Enough? by Jan 10, 2025, 6:57 PM; 28 points) (
- You won’t solve alignment without agent foundations by Nov 6, 2022, 8:07 AM; 27 points) (
- The Logistics of Distribution of Meaning: Against Epistemic Bureaucratization by Nov 7, 2024, 5:27 AM; 27 points) (
- Thoughts on ‘List of Lethalities’ by Aug 17, 2022, 6:33 PM; 27 points) (
- Project ideas: Backup plans & Cooperative AI by Jan 4, 2024, 7:26 AM; 25 points) (EA Forum;
- Mar 22, 2023, 3:25 AM; 25 points) 's comment on Where I’m at with AI risk: convinced of danger but not (yet) of doom by (EA Forum;
- Jun 7, 2022, 4:08 AM; 25 points) 's comment on We will be around in 30 years by (
- EA & LW Forums Weekly Summary (10 − 16 Oct 22′) by Oct 17, 2022, 10:51 PM; 24 points) (EA Forum;
- What The Lord of the Rings Teaches Us About AI Alignment by Jul 31, 2023, 8:16 PM; 24 points) (
- Jun 25, 2022, 9:08 AM; 23 points) 's comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by May 3, 2023, 8:26 PM; 23 points) (
- Being at peace with Doom by Apr 9, 2023, 2:53 PM; 23 points) (
- Agenty AGI – How Tempting? by Jul 1, 2022, 11:40 PM; 22 points) (
- Deep neural networks are not opaque. by Jul 6, 2022, 6:03 PM; 22 points) (
- AXRP Episode 20 - ‘Reform’ AI Alignment with Scott Aaronson by Apr 12, 2023, 9:30 PM; 22 points) (
- Best introductory overviews of AGI safety? by Dec 13, 2022, 7:04 PM; 21 points) (EA Forum;
- Best introductory overviews of AGI safety? by Dec 13, 2022, 7:01 PM; 21 points) (
- Resources that (I think) new alignment researchers should know about by Oct 28, 2022, 10:13 PM; 20 points) (EA Forum;
- Making Nanobots isn’t a one-shot process, even for an artificial superintelligance by Apr 25, 2023, 12:39 AM; 20 points) (
- Why and When Interpretability Work is Dangerous by May 28, 2023, 12:27 AM; 20 points) (
- Effective Altruism, the Principle of Explosion and Epistemic Fragility by Aug 15, 2022, 1:45 AM; 19 points) (EA Forum;
- Can We Align AI by Having It Learn Human Preferences? I’m Scared (summary of last third of Human Compatible) by Jun 29, 2022, 4:09 AM; 19 points) (
- Quantitative cruxes in Alignment by Jul 2, 2023, 8:38 PM; 19 points) (
- Compendium of problems with RLHF by Jan 30, 2023, 8:48 AM; 18 points) (EA Forum;
- Next steps after AGISF at UMich by Jan 25, 2023, 8:57 PM; 18 points) (EA Forum;
- Project ideas: Backup plans & Cooperative AI by Jan 8, 2024, 5:19 PM; 18 points) (
- How should one feel morally about using chatbots? by May 11, 2023, 1:01 AM; 18 points) (
- A recent write-up of the case for AI (existential) risk by May 18, 2023, 1:07 PM; 17 points) (EA Forum;
- A stubborn unbeliever finally gets the depth of the AI alignment problem by Oct 13, 2022, 3:16 PM; 17 points) (
- My disagreements with “AGI ruin: A List of Lethalities” by Sep 15, 2024, 5:22 PM; 16 points) (EA Forum;
- A Quick List of Some Problems in AI Alignment As A Field by Jun 21, 2022, 5:09 PM; 16 points) (EA Forum;
- Sep 16, 2023, 3:30 PM; 16 points) 's comment on AI Pause Will Likely Backfire by (EA Forum;
- A newcomer’s guide to the technical AI safety field by Nov 4, 2022, 2:29 PM; 16 points) (EA Forum;
- AGI ruin mostly rests on strong claims about alignment and deployment, not about society by Apr 24, 2023, 1:07 PM; 16 points) (EA Forum;
- Consider trying Vivek Hebbar’s alignment exercises by Oct 24, 2022, 7:46 PM; 16 points) (EA Forum;
- Best resource to go from “typical smart tech-savvy person” to “person who gets AGI risk urgency”? by Oct 15, 2022, 10:26 PM; 16 points) (
- Corrigibility or DWIM is an attractive primary goal for AGI by Nov 25, 2023, 7:37 PM; 16 points) (
- Jun 29, 2022, 7:06 AM; 16 points) 's comment on Conversation with Eliezer: What do you want the system to do? by (
- We’re all in this together by Dec 5, 2023, 1:57 PM; 15 points) (EA Forum;
- Being at peace with Doom by Apr 9, 2023, 3:01 PM; 15 points) (EA Forum;
- Gearing Up for Long Timelines in a Hard World by Jul 14, 2023, 6:11 AM; 15 points) (
- Podcast: Shoshannah Tekofsky on skilling up in AI safety, visiting Berkeley, and developing novel research ideas by Nov 25, 2022, 8:47 PM; 14 points) (EA Forum;
- You won’t solve alignment without agent foundations by Nov 6, 2022, 8:07 AM; 14 points) (EA Forum;
- Why I think strong general AI is coming soon by Sep 28, 2022, 6:55 AM; 14 points) (EA Forum;
- [Crosspost] AI Regulation May Be More Important Than AI Alignment For Existential Safety by Aug 24, 2023, 4:01 PM; 14 points) (EA Forum;
- Apr 11, 2023, 3:52 PM; 14 points) 's comment on Four mindset disagreements behind existential risk disagreements in ML by (
- Jul 28, 2022, 12:26 AM; 14 points) 's comment on AGI ruin scenarios are likely (and disjunctive) by (
- Jun 18, 2023, 2:30 PM; 13 points) 's comment on My lab’s small AI safety agenda by (EA Forum;
- Box inversion revisited by Nov 7, 2023, 11:09 AM; 13 points) (EA Forum;
- Oct 19, 2024, 10:29 PM; 13 points) 's comment on The Hidden Complexity of Wishes by (
- Transparency for Generalizing Alignment from Toy Models by Apr 2, 2023, 10:47 AM; 13 points) (
- Aug 29, 2022, 1:52 PM; 13 points) 's comment on (My understanding of) What Everyone in Technical Alignment is Doing and Why by (
- Getting from an unaligned AGI to an aligned AGI? by Jun 21, 2022, 12:36 PM; 13 points) (
- May 8, 2024, 3:48 AM; 13 points) 's comment on quila’s Shortform by (
- Questions about Value Lock-in, Paternalism, and Empowerment by Nov 16, 2022, 3:33 PM; 13 points) (
- Sep 2, 2022, 8:47 AM; 12 points) 's comment on My take on What We Owe the Future by (EA Forum;
- Sep 16, 2023, 4:18 PM; 12 points) 's comment on AI Pause Will Likely Backfire by (EA Forum;
- Jun 23, 2022, 5:30 AM; 12 points) 's comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- Non-classic stories about scheming (Section 2.3.2 of “Scheming AIs”) by Dec 4, 2023, 6:44 PM; 12 points) (EA Forum;
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? by Mar 10, 2023, 8:20 AM; 12 points) (EA Forum;
- Optimizing Human Collective Intelligence to Align AI by Jan 7, 2023, 1:21 AM; 12 points) (
- To determine alignment difficulty, we need to know the absolute difficulty of alignment generalization by Mar 14, 2023, 3:52 AM; 12 points) (
- Mar 30, 2023, 7:14 PM; 12 points) 's comment on Pausing AI Developments Isn’t Enough. We Need to Shut it All Down by Eliezer Yudkowsky by (
- EA & LW Forums Weekly Summary (10 − 16 Oct 22′) by Oct 17, 2022, 10:51 PM; 12 points) (
- Lifeguards by Jun 15, 2022, 11:03 PM; 12 points) (
- Four Phases of AGI by Aug 5, 2024, 1:15 PM; 12 points) (
- Mar 15, 2023, 11:29 AM; 12 points) 's comment on POC || GTFO culture as partial antidote to alignment wordcelism by (
- Truth and Advantage: Response to a draft of “AI safety seems hard to measure” by Mar 22, 2023, 3:36 AM; 11 points) (EA Forum;
- Jun 23, 2022, 6:29 PM; 11 points) 's comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- Aug 29, 2022, 9:57 PM; 11 points) 's comment on Preventing an AI-related catastrophe—Problem profile by (EA Forum;
- Apr 13, 2023, 7:00 PM; 10 points) 's comment on Artificial Intelligence as exit strategy from the age of acute existential risk by (EA Forum;
- A Case Against Strong Longtermism by Sep 2, 2022, 4:40 PM; 10 points) (EA Forum;
- The “no sandbagging on checkable tasks” hypothesis by Jul 31, 2023, 11:13 PM; 10 points) (EA Forum;
- Sep 24, 2022, 8:30 AM; 10 points) 's comment on Announcing the Future Fund’s AI Worldview Prize by (EA Forum;
- Next steps after AGISF at UMich by Jan 25, 2023, 8:57 PM; 10 points) (
- Capability and Agency as Cornerstones of AI risk — My current model by Sep 15, 2022, 8:25 AM; 10 points) (
- Jun 14, 2022, 9:15 AM; 10 points) 's comment on On A List of Lethalities by (
- Why There Is Hope For An Alignment Solution by Jan 8, 2024, 6:58 AM; 10 points) (
- Half-baked alignment idea: training to generalize by Jun 19, 2022, 8:16 PM; 10 points) (
- Nov 18, 2023, 2:17 AM; 10 points) 's comment on Sam Altman fired from OpenAI by (
- Cooperation with and between AGI\’s by Jul 7, 2022, 4:45 PM; 10 points) (
- Alignment with argument-networks and assessment-predictions by Dec 13, 2022, 2:17 AM; 10 points) (
- AGI doesn’t need understanding, intention, or consciousness in order to kill us, only intelligence by Feb 20, 2023, 12:55 AM; 10 points) (
- Nov 20, 2024, 2:41 AM; 10 points) 's comment on Has Eliezer publicly and satisfactorily responded to attempted rebuttals of the analogy to evolution? by (
- Dec 18, 2024, 11:04 AM; 10 points) 's comment on There are no coherence theorems by (
- What are some other introductions to AI safety? by Feb 17, 2025, 11:48 AM; 9 points) (EA Forum;
- Jun 10, 2022, 9:43 PM; 9 points) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- Non-classic stories about scheming (Section 2.3.2 of “Scheming AIs”) by Dec 4, 2023, 6:44 PM; 9 points) (
- Mar 22, 2023, 3:07 PM; 9 points) 's comment on Droopyhammock’s Shortform by (
- Is this a Pivotal Weak Act? Creating bacteria that decompose metal by Sep 11, 2024, 6:07 PM; 9 points) (
- Capture today’s market, capture tomorrow’s game board by Jun 20, 2023, 12:45 AM; 9 points) (
- Apr 18, 2023, 5:51 AM; 9 points) 's comment on The basic reasons I expect AGI ruin by (
- One Does Not Simply Replace the Humans by Apr 6, 2023, 8:56 PM; 9 points) (
- Jun 15, 2022, 9:56 AM; 9 points) 's comment on Has there been any work on attempting to use Pascal’s Mugging to make an AGI behave? by (
- Do EA folks want AGI at all? by Jul 16, 2022, 5:44 AM; 8 points) (EA Forum;
- Jun 21, 2022, 1:09 AM; 8 points) 's comment on On Deference and Yudkowsky’s AI Risk Estimates by (EA Forum;
- Please wonder about the hard parts of the alignment problem by Jul 11, 2023, 5:02 PM; 8 points) (EA Forum;
- Apr 9, 2023, 6:02 PM; 8 points) 's comment on All AGI Safety questions welcome (especially basic ones) [April 2023] by (EA Forum;
- Oct 5, 2024, 4:31 AM; 8 points) 's comment on Stephen Fowler’s Shortform by (
- Jun 4, 2024, 11:13 PM; 8 points) 's comment on What do coherence arguments actually prove about agentic behavior? by (
- Worldwork for Ethics by Oct 17, 2023, 6:55 PM; 8 points) (
- Can We Align a Self-Improving AGI? by Aug 30, 2022, 12:14 AM; 8 points) (
- May 9, 2024, 11:15 PM; 8 points) 's comment on “How could I have thought that faster?” by (
- AI Risk Microdynamics Survey by Oct 9, 2022, 8:00 PM; 7 points) (EA Forum;
- Sep 7, 2022, 5:44 PM; 7 points) 's comment on ChanaMessinger’s Quick takes by (EA Forum;
- Nov 17, 2022, 6:20 PM; 7 points) 's comment on Samo Burja: What the collapse of FTX means for effective altruism by (EA Forum;
- Jun 7, 2022, 7:09 AM; 7 points) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- Nov 25, 2024, 10:39 PM; 7 points) 's comment on eigen’s Shortform by (
- The Personal Implications of AGI Realism by Oct 20, 2024, 4:43 PM; 7 points) (
- A TAI which kills all humans might also doom itself by May 16, 2023, 10:36 PM; 7 points) (
- Apr 6, 2023, 10:00 PM; 7 points) 's comment on Suggestion for safe AI structure (Curated Transparent Decisions) by (
- Towards Non-Panopticon AI Alignment by Jul 6, 2023, 3:29 PM; 7 points) (
- Apr 14, 2023, 8:45 AM; 6 points) 's comment on Artificial Intelligence as exit strategy from the age of acute existential risk by (EA Forum;
- Why and When Interpretability Work is Dangerous by May 28, 2023, 12:27 AM; 6 points) (EA Forum;
- World and Mind in Artificial Intelligence: arguments against the AI pause by Apr 18, 2023, 2:35 PM; 6 points) (EA Forum;
- How likely are malign priors over objectives? [aborted WIP] by Nov 11, 2022, 6:03 AM; 6 points) (EA Forum;
- Dec 5, 2023, 11:07 PM; 6 points) 's comment on We’re all in this together by (
- Sep 11, 2023, 9:13 AM; 6 points) 's comment on Focus on the Hardest Part First by (
- Mesa-optimization for goals defined only within a training environment is dangerous by Aug 17, 2022, 3:56 AM; 6 points) (
- Apr 14, 2023, 7:21 PM; 6 points) 's comment on Compendium of problems with RLHF by (
- Cast it into the fire! Destroy it! by Jan 13, 2025, 7:30 AM; 6 points) (
- Oct 5, 2023, 8:20 PM; 6 points) 's comment on Evaluating the historical value misspecification argument by (
- Dec 11, 2022, 9:55 PM; 6 points) 's comment on MIRI announces new “Death With Dignity” strategy by (
- [Crosspost] A recent write-up of the case for AI (existential) risk by May 18, 2023, 1:13 PM; 6 points) (
- Jun 15, 2022, 11:15 AM; 5 points) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- Nov 6, 2023, 9:44 AM; 5 points) 's comment on Genetic fitness is a measure of selection strength, not the selection target by (
- Oct 13, 2024, 10:19 PM; 5 points) 's comment on Most arguments for AI Doom are either bad or weak by (
- Aug 5, 2022, 12:26 AM; 5 points) 's comment on Externalized reasoning oversight: a research direction for language model alignment by (
- May 22, 2023, 8:34 AM; 5 points) 's comment on [outdated] My current theory of change to mitigate existential risk by misaligned ASI by (
- May 4, 2023, 8:08 AM; 5 points) 's comment on AGI rising: why we are in a new era of acute risk and increasing public awareness, and what to do now by (
- Morality as Cooperation Part I: Humans by Dec 5, 2024, 8:16 AM; 5 points) (
- Nov 10, 2022, 12:19 PM; 5 points) 's comment on How could we know that an AGI system will have good consequences? by (
- An Exercise to Build Intuitions on AGI Risk by Jun 8, 2023, 11:20 AM; 4 points) (EA Forum;
- Questions about Value Lock-in, Paternalism, and Empowerment by Nov 16, 2022, 3:33 PM; 4 points) (EA Forum;
- Mar 6, 2023, 4:41 AM; 4 points) 's comment on Two important recent AI Talks- Gebru and Lazar by (EA Forum;
- The Control Problem: Unsolved or Unsolvable? by Jun 2, 2023, 3:42 PM; 4 points) (EA Forum;
- Oct 1, 2023, 12:38 AM; 4 points) 's comment on “Diamondoid bacteria” nanobots: deadly threat or dead-end? A nanotech investigation by (EA Forum;
- Jun 24, 2022, 9:43 PM; 4 points) 's comment on Half-baked AI Safety ideas thread by (
- Jun 28, 2022, 2:16 PM; 4 points) 's comment on Doom doubts—is inner alignment a likely problem? by (
- AI Hiroshima (Does A Vivid Example Of Destruction Forestall Apocalypse?) by Jul 18, 2022, 12:06 PM; 4 points) (
- Aug 8, 2024, 7:23 PM; 4 points) 's comment on Does VETLM solve AI superalignment? by (
- Jun 14, 2022, 2:52 AM; 4 points) 's comment on Continuity Assumptions by (
- Sep 7, 2023, 6:57 AM; 4 points) 's comment on Meta Questions about Metaphilosophy by (
- Jun 30, 2022, 8:13 PM; 4 points) 's comment on $500 bounty for alignment contest ideas by (
- Sep 12, 2024, 6:21 PM; 4 points) 's comment on james oofou’s Shortform by (
- Dec 16, 2023, 3:25 AM; 4 points) 's comment on “AI Alignment” is a Dangerously Overloaded Term by (
- Sep 9, 2024, 8:26 PM; 4 points) 's comment on Demis Hassabis — Google DeepMind: The Podcast by (
- Dec 14, 2024, 12:29 AM; 4 points) 's comment on A case for AI alignment being difficult by (
- Update on Developing an Ethics Calculator to Align an AGI to by Mar 12, 2024, 12:33 PM; 4 points) (
- Aug 27, 2022, 9:13 PM; 3 points) 's comment on «Boundaries», Part 1: a key missing concept from utility theory by (
- # **Announcement of AI-Plans.com Critique-a-thon September 2023** by Sep 14, 2023, 5:43 PM; 3 points) (
- Jun 19, 2022, 9:58 PM; 3 points) 's comment on Let’s See You Write That Corrigibility Tag by (
- Use computers as powerful as in 1985 or AI controls humans or ? by Feb 3, 2025, 12:51 AM; 3 points) (
- Does Robust Agency Require a Self? by Mar 25, 2025, 12:25 AM; 3 points) (
- Nov 29, 2022, 5:08 PM; 3 points) 's comment on Announcing AI Alignment Awards: $100k research contests about goal misgeneralization & corrigibility by (
- Seeking feedback on a critique of the paperclip maximizer thought experiment by Jul 15, 2024, 6:39 PM; 3 points) (
- Monthly Doom Argument Threads? Doom Argument Wiki? by Feb 4, 2023, 4:59 PM; 3 points) (
- Issues with uneven AI resource distribution by Dec 24, 2022, 1:18 AM; 3 points) (
- Dec 24, 2022, 1:20 AM; 3 points) 's comment on Issues with uneven AI resource distribution by (
- Dec 31, 2023, 9:18 AM; 3 points) 's comment on The Dark Arts by (
- Jan 8, 2024, 8:36 PM; 3 points) 's comment on When “yang” goes wrong by (
- Mar 24, 2023, 3:24 AM; 3 points) 's comment on continue working on hard alignment! don’t give up! by (
- Nov 16, 2024, 3:05 AM; 3 points) 's comment on My AI Model Delta Compared To Yudkowsky by (
- Jul 24, 2022, 11:38 PM; 3 points) 's comment on A note about differential technological development by (
- Proposal for an AI Safety Prize by Jan 31, 2024, 6:35 PM; 3 points) (
- Mar 30, 2023, 4:38 PM; 3 points) 's comment on Want to win the AGI race? Solve alignment. by (
- Nov 29, 2022, 5:56 PM; 2 points) 's comment on Beyond Simple Existential Risk: Survival in a Complex Interconnected World by (EA Forum;
- Aug 16, 2022, 10:09 PM; 2 points) 's comment on Refuting longtermism with Fermat’s Last Theorem by (EA Forum;
- Jan 18, 2023, 11:48 AM; 2 points) 's comment on Help me to understand AI alignment! by (EA Forum;
- Aug 29, 2022, 9:57 PM; 2 points) 's comment on Preventing an AI-related catastrophe—Problem profile by (EA Forum;
- Nature Releases A Stupid Editorial On AI Risk by Jun 29, 2023, 7:00 PM; 2 points) (
- Probability of AI-Caused Disaster by Feb 12, 2025, 7:40 PM; 2 points) (
- Apr 11, 2023, 9:24 PM; 2 points) 's comment on Where’s the foom? by (
- Aug 10, 2022, 7:03 PM; 2 points) 's comment on Where I agree and disagree with Eliezer by (
- Most arguments for AI Doom are either bad or weak by Oct 12, 2024, 11:57 AM; 2 points) (
- Oct 6, 2022, 10:00 PM; 2 points) 's comment on So, geez there’s a lot of AI content these days by (
- Jan 1, 2025, 8:29 PM; 2 points) 's comment on Turing-Test-Passing AI implies Aligned AI by (
- Mar 27, 2023, 3:59 AM; 2 points) 's comment on Speed running everyone through the bad alignment bingo. $5k bounty for a LW conversational agent by (
- Apr 9, 2024, 1:32 PM; 2 points) 's comment on The Shutdown Problem: Incomplete Preferences as a Solution by (
- Oct 5, 2023, 7:53 PM; 2 points) 's comment on Evaluating the historical value misspecification argument by (
- What are these “outside of the Overton window” approaches to preventing AI apocalypse that Eliezer was talking about in his post? by Jun 14, 2022, 9:18 PM; 2 points) (
- Oct 23, 2023, 2:30 AM; 2 points) 's comment on Alignment Implications of LLM Successes: a Debate in One Act by (
- Dec 21, 2022, 5:56 PM; 1 point) 's comment on Posit: Most AI safety people should work on alignment/safety challenges for AI tools that already have users (Stable Diffusion, GPT) by (EA Forum;
- Jun 24, 2022, 9:44 PM; 1 point) 's comment on Half-baked ideas thread (EA / AI Safety) by (EA Forum;
- Jun 17, 2022, 9:43 PM; 1 point) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- Jun 7, 2022, 8:07 PM; 1 point) 's comment on AGI Safety FAQ / all-dumb-questions-allowed thread by (
- Aug 21, 2022, 7:24 PM; 1 point) 's comment on My Plan to Build Aligned Superintelligence by (
- Oct 16, 2024, 2:53 AM; 1 point) 's comment on Limerence Messes Up Your Rationality Real Bad, Yo by (
- Open-ended ethics of phenomena (a desiderata with universal morality) by Nov 8, 2023, 8:10 PM; 1 point) (
- Sep 25, 2023, 11:37 PM; 1 point) 's comment on My experience at and around MIRI and CFAR (inspired by Zoe Curzi’s writeup of experiences at Leverage) by (
- Jun 15, 2022, 7:11 AM; 1 point) 's comment on Aaron Bergman’s Shortform by (
- Mar 28, 2023, 7:51 PM; 1 point) 's comment on All AGI Safety questions welcome (especially basic ones) [~monthly thread] by (
- A flaw in the A.G.I. Ruin Argument by May 19, 2023, 7:40 PM; 1 point) (
- May 31, 2023, 10:47 PM; 1 point) 's comment on The basic reasons I expect AGI ruin by (
- Dec 5, 2024, 2:56 AM; 1 point) 's comment on Evaluating the historical value misspecification argument by (
- Mar 10, 2023, 1:07 PM; 1 point) 's comment on Anthropic’s Core Views on AI Safety by (
- Sep 15, 2023, 12:00 PM; 0 points) 's comment on The AI apocalypse myth. by (
- How likely are malign priors over objectives? [aborted WIP] by Nov 11, 2022, 5:36 AM; -1 points) (
- Do we have a plan for the “first critical try” problem? by Apr 3, 2023, 4:27 PM; -3 points) (
- Uncursing Civilization by Jul 1, 2024, 6:44 PM; -6 points) (
- Impossibility of Anthropocentric-Alignment by Feb 24, 2024, 6:31 PM; -8 points) (
- Should any human enslave an AGI system? by Jun 25, 2022, 7:35 PM; -15 points) (
- A Modest Pivotal Act by Jun 13, 2022, 7:24 PM; -16 points) (
- Why We MUST Create an AGI that Disempowers Humanity. For Real. by Mar 22, 2023, 11:01 PM; -17 points) (
- Sep 25, 2023, 8:23 PM; -20 points) 's comment on Amazon to invest up to $4 billion in Anthropic by (EA Forum;
+9. This is a powerful set of arguments pointing out how humanity will literally go extinct soon due to AI development (or have something similarly bad happen to us). A lot of thought and research went into an understanding of the problem that can produce this level of understanding of the problems we face, and I’m extremely glad it was written up.
Solid, aside from the faux-pass self-references. If anyone wonders why people would have a high p(doom), especially Yudkowsky himself, this doc solves the problem in a single place. Demonstrates why AI safety is superior to most other elite groups; we don’t just say why we think something, we make it easy to find as well. There still isn’t much need for Yudkowsky to clarify further, even now.
I’d like to note that my professional background makes me much better at evaluating Section C than Sections A and B. Section C is highly quotable, well worth multiple reads, and predicted trends that ended up continuing even today. I’m not so impressed with people’s responses at the time (including my own).
Edit: I still think I am right about everything in this review. I would further recommend that everyone reread the entire doc on a ~1.5 year cycle, like the ~4 year cycle for the Sequences and the single reread of HPMOR before 10 years pass. It is, in fact, all in one place.
The “even today” link sometimes breaks, it is supposed to go directly to the paragraph about international treaties in this section.
After I read this, I started avoiding reading about others’ takes on alignment so I could develop my own opinions.