I’ve been very heavily involved in the (online) rationalist community for a few months now, and like many others, I have found myself quite freaked out by the apparent despair/lack of hope that seems to be sweeping the community. When people who are smarter than you start getting scared, it seems wise to be concerned as well, even if you don’t fully understand the danger. Nonetheless, it’s important not to get swept up in the crowd. I’ve been trying to get a grasp on why so many seem so hopeless, and these are the assumptions I believe they are making (trivial assumptions included, for completeness; there may be some overlap in this list):
AGI is possible to create.
AGI will be created within the next century or so, possibly even within the next few years.
If AGI is created by people who are not sufficiently educated (aka aware of a solution to the Alignment problem) and cautious, then it will almost certainly be unaligned.
Unaligned AGI will try to do something horrible to humans (not out of maliciousness, necessarily, we could just be collateral damage), and will not display sufficiently convergent behavior to have anything resembling our values.
We will not be able to effectively stop an unaligned AGI once it is created (due to the Corrigibility problem).
We have not yet solved the Alignment problem (of which the Corrigibility problem is merely a subset), and there does not appear to be any likely avenues to success (or at least we should not expect success within the next few decades).
Even if we solved the Alignment problem, if a non-aligned AGI arrives on the scene before we can implement ours, we are still doomed (due to first-mover advantage).
Our arguments for all of the above are not convincing or compelling enough for most AI researchers to take the threat seriously.
As such, unless some drastic action is taken soon, unaligned AGI will be created shortly, and that will be the end of the world as we know it.
First of all, is my list of seemingly necessary assumptions correct?
If so, it seems to me that most of these are far from proven statements of fact, and in fact are all heavily debated. Assumption 8 in particular seems to highlight this, as if a strong enough case could be made for each of the previous assumptions, it would be fairly easy to convince most intelligent researchers, which we don’t seem to observe.
A historical example which bears some similarities to the current situation may be Godel’s resolution to Hilbert’s program. He was able to show unarguably that no consistent finite system of axioms is capable of proving all truths, at which point the mathematical community was able to advance beyond the limitations of early formalism. As far as I am aware, no similarly strong argument exists for even one of the assumptions listed above.
Given all of this, and the fact that there are so many uncertainties here, I don’t understand why so many researchers (most prominently Eliezer Yudkowsky, but there are countless more) seem so certain that we are doomed. I find it hard to believe that all alignment ideas presented so far show no promise, considering I’ve yet to see a slam-dunk argument presented for why even a single modern alignment proposals can’t work. (Yes, I’ve seen proofs against straw-man proposals, but not really any undertaken by a current expert in the field). This may very well be due to my own ignorance/ relative newness, however, and if so, please correct me!
I’d like to hear the steelmanned argument for why alignment is hopeless, and Yudkowsky’s announcement that “I’ve tried and couldn’t solve it” without more details doesn’t really impress me. My suspicion is I’m simply missing out on some crucial context, so consider this thread a chance to share your best arguments for AGI-related pessimism. (Later in the week I’ll post a thread from the opposite direction, in order to balance things out).
EDIT: Read the comments section if you have the time; there’s some really good discussion there, and I was successfully convinced of a few specifics that I’m not sure how to incorporate into the original text. 🙃
Regarding your list, Eliezer has written extensively about exactly why those seem like good assumptions. If you want a quick summary though...
Human beings, at least some of us, appear to be generally intelligent. Unless you believe that this is due to a supernatural phenomenon (maybe souls are capable of hypercomputing?), general intelligence is thus demonstrably a thing that can exist in the natural world if matter is in the right configuration for it. Eventually, human engineering should be able to discover and create the right configuration.
Modern neural nets appear to work closely analogously to the brain, with neurons firing or not depending on which other neurons are firing and knowledge represented in which neurons are connected and how strongly. While it would require a bit of math to explain rigorously, this is a system that is capable of producing nearly any output due to any change in the input, and is thus flexible enough to reflect nearly any pattern. Backpropagation can in turn be used to find any patterns in the inputs (as well as more advanced techniques such as the Google Pathways system), and a program that knows the relevant patterns in what it’s looking at can both predict and optimize. If that isn’t obvious, consider that backprop can select for a program that predicts relevant results of the observed system, and that reversing this program allows for predicting which system states have a given result, which in turn allows for optimization. If this still isn’t obvious, I’d be happy to answer any questions you have in the comments; this part is complicated enough that trying to do it justice in a paragraph is difficult. Given that artificial neural nets appear to have generalizable prediction and optimization abilities though, it doesn’t seem too much of a stretch that researchers will be able to scale them up to a fully general understanding of the world this century, and quite possibly this decade.
Default nonalignment arises from simple entropy. There are an inconceivable number of possible goals in the world, and a mind created to fulfill one of them without careful specification is unlikely to end up with one of the very few goals that is consistent with human survival and flourishing. The obvious counterargument to this is that an AI isn’t likely to be created with a random goal; its creators are likely to at least give it instructions like “make everyone happy”. The counter-counterargument, however, is that our values are difficult to specify in terms that will make sense to a machine that doesn’t have human instincts. If I ask you to “make someone happy”, you implicitly understand a vast array of ideas that accompany the request: I’m asking you to help them out in a way that matches the sort of help people could give each other in normal life. A birthday present counts; wiring their brain’s pleasure centers up to a wall socket probably doesn’t; threatening to kill their loved ones if they don’t claim to be happy is right out. But just like computers learning simple code do exactly what you say without any instinctive understanding of what you really meant, a computer receiving a specification of what it ought to do on a world-changing scale will be prone to bugs where what we wanted and what we asked for diverge (which is the source of bugs today as well!)
This point relies on two things: collateral damage and the arbitrariness of values. The risk of collateral damage should be quite clear when considering what happens to other animals caught in the way of human projects. We tend not to even notice anthills bulldozed to make way for a new building. As for values, it is certainly possible to attempt to predict any given quantity, be it human happiness or the number of purple polka dots in the world. And turning that into optimizing for the quantity is as simple as picking actions that are predicted to result in the highest values of it. Nowhere along the line does anything like human decency enter the picture, not by default. If you have further questions about this I would recommend looking up the Orthogonality Thesis, the idea that any level of intelligence can coexist with any set of baseline values. Our values are certainly not arbitrary to us, but they do not appear to be part of the basic structure of math in a way that would force all possible minds to agree.
This isn’t just about corrigibility. An unaligned but perfectly corrigible AI (i.e. one that would follow any order to stop what it was doing and change its actions and values as directed) would still be a danger, as it would have excellent reason to ensure that we couldn’t give the order that would halt its plans! How dangerous a mind smarter than us could be is unpredictable (we could not, after all, know exactly what it would do without being that smart ourselves), but given both how easily humans are able to dominate even slightly less intelligent animals (the difference in intellect between a human and a chimpanzee is fairly small relative to the range of animal intelligence, and if we can make general AI at all, we can likely make one smarter than we are by a much larger margin than that between us and the other species) and that even within the range of plans humans have been able to think up, strategies like nanotech promise nearly total control of the world to anyone who can figure out the exact details, it seems unwise to expect to survive a conflict with a hostile superintelligence.
Certainly we have not yet solved alignment, and most existing alignment researchers have no clear idea of how progress can be made even in principle. This is one area where I personally diverge from the Less Wrong consensus a bit, however, as I suspect that it should be possible to create a viable alignment strategy by experimentation with AIs that are fairly powerful, but neither yet human level nor smart enough to pose the risks of a superintelligence. However, such a bootstrapping strategy is so far purely theoretical, and the current approach of trying to come up with human-understandable alignment strategies purely by human cognition has shown almost no progress thus far. There have been a few interesting ideas thrown around, such as Functional Decision Theory, an approach to making choices that avoids many common pitfalls, and Coherent Extrapolated Volition, a theory of value that seeks to avoid locking in our existing mistakes and misapprehensions. However, neither these ideas nor any other produced thus far by alignment researchers can be used in practice yet to prevent an AI from getting the wrong idea of what to pursue, nor from being lethally stubborn in pursuing that wrong idea.
A hostile superintelligence stands a decent chance of killing us all, or else of ensuring that we cannot take any action that could interfere with its goals. That’s quite a large first mover advantage.
At the risk of sounding incredibly cynical, the problem in convincing a great many AI researchers isn’t a matter of the convincingness or lack thereof of the arguments. Rather, most people simply follow habits and play roles, and any argument that they should change their comfortable routine will, for most people, be rejected out of hand. On the bright side, DeepMind, one of the leading organizations in the field of AI research, is actually somewhat interested in alignment, and has already done some work looking into how far a goal can be optimized before degenerate results occur. This doesn’t guarantee they’ll succeed, of course, and some researchers looking into the problem isn’t the same as a robust institutional AI safety culture. But it’s a very good sign that this story might have a happy ending after all, if people are sufficiently careful and smart.
Given all of this, the likelihood of world-ending AI fairly soon (timeline estimates vary, but I would not be at all surprised to see AGI this decade) and the difficulty of alignment, hopefully it is a little clearer now why so many here are concerned. That said, I think there is still quite a lot of hope, at least if the alignment community starts looking into experiments aimed at creating agents that can get better at understanding other agents’ values, and better at avoiding too much disruption along the way.
It might be helpful for formatting if you put the original list adjacent to your responses.
Good idea. Do you know how to turn off the automatic list numbering?
You can’t really do that, it’s a markdown feature. If you were to use asterisks (
*
), you could get bullet points.Thanks for the really insightful answer! I think I’m pretty much convinced on points 1, 2, 5, and 7, mostly agree with you on 6 and 8, and still don’t understand the sheer hopelessness of people who strongly believe 9. Assumptions 3, and 4, however, I’m not sure I fully follow, as it doesn’t seem like a slam dunk that the orthogonality thesis is true, as far as I can tell. I’d expect there to be basins of attraction towards some basic values, or convergence, sort of like carcinisation.
Carcinisation is an excellent metaphor for convergent instrumental values, i.e. values that are desired for ends other than themselves, and which can serve a wide variety of ends, and thus might be expected to occur in a wide variety of minds. In fact, there’s been some research on exactly that by Steve Omohundro, who defined the Omohundro Goals (well worth looking up). These are things like survival and preservation of your other goals, as it’s usually much easier to accomplish a thing if you remain alive to work on it, and continue to value doing so. However, orthogonality doesn’t apply to instrumental goals, which can do a good or bad job of serving as an effective path to other goals, and thus experience selection and carcinisation. Rather, it applies to terminal goals, those things we want purely for their own sake. It’s impossible to judge terminal goals as good or bad (except insofar as they accord or conflict with our own terminal goals, and that’s not a standard an AI automatically has to care about), as they are themselves the standard by which everything else is judged. The researcher Rob Miles has an excellent YouTube video about this you might enjoy entitled Intelligence and Stupidity: the Orthogonality Thesis, which goes into more depth. (Sorry for the lack of direct links; I’m sending this from my phone immediately before going to bed.)
• Intelligence And Stupidity by Rob Miles on YouTube
• Orthogonality Thesis on Arbital.com
As a minor token of how much you’re missing:
You can educate them all you want about the dangers, they’ll still die. No solution is known. Doesn’t matter if a particular group is cautious enough to not press forwards (as does not at all presently seem to be the case, note), next group in line destroys the world.
You paint a picture of a world put in danger by mysteriously uncautious figures just charging ahead for no apparent reason.
This picture is unfortunately accurate, due to how little dignity we’re dying with.
But if we were on course to die with more dignity than this, we’d still die. The recklessness is not the source of the problem. The problem is that cautious people do not know what to do to get an AI that doesn’t destroy the world, even if they want that; not because they’re “insufficiently educated” in some solution that is known elsewhere, but because there is no known plan in which to educate them.
If you knew this, you sure picked a strange straw way of phrasing it, to say that the danger was AGI created by “people who are not sufficiently educated”, as if any other kind of people could exist, or it was a problem that could be solved by education.
For what it’s worth, I interpreted Yitz’s words as having the subtext “and no one, at present, as sufficiently educated, because no good solution is known” and not the subtext “so it’s OK because all we have to do is educate people”.
(Also and unrelatedly: I don’t think it’s right to say “The recklessness is not the source of the problem”. It seems to me that the recklessness is a problem potentially sufficient to kill us all, and not knowing a solution to the alignment problem is a problem potentially sufficient to kill us all, and both of those problems are likely very hard to solve. Neither is the source of the problem; the problem has multiple sources all potentially sufficient to wipe us out.)
Thanks for the charitable read :)
I fully agree with your last point, btw. If I remember correctly (could be misremembering though), EY has stated in the past that it doesn’t matter if you can convince everyone alignment is hard, but I don’t think that’s fully true. If you really can convince a sufficient number of people to take alignment seriously, and not be reckless, you can affect governance, and simply prevent (or at least delay) AGI from being built in the first place.
Delay it for a few years, sure. Maybe. If you magically convince our idiotic governments of a complex technical fact that doesn’t fit the prevailing political narratives.
But if there are some people who are convinced they have a magic alignment solution…
Someone is likely to run some sort of AI sooner or later. Unless some massive effort to restrict access to computers or something.
Well then, imagine a hypothetical in which the world succeeds at a massive effort to restrict access to compute. That would be a primarily social challenge, to convince the relatively few people at the top to take the risk seriously enough to do that, and then you’ve actually got a pretty permanent solution...
Is it primarily a social challenge? Humanity now relies relatively heavily on quick and easy communications, CAD[1], computer-aided data processing for e.g. mineral prospecting, etc, etc.
(One could argue that we got along without this in the early-to-mid 1900s, but at the same time we now have significantly more people. Ditto, it wasn’t exactly sustainable.)
Computer-aided design
Apologies for the strange phrasing, I’ll try to improve my writing skills in that area. I actually fully agree with you that [assuming even “slightly unaligned”[1] AGI will kill us], even highly educated people who put a match to kerosene will get burned. By using the words “sufficiently educated,” my intention was to denote that in some sense, there is no sufficiently educated person on this planet, at least not yet.
Well, I think that this is a problem that can be solved with education, at least in theory. The only problem is that we have no teachers (or even a lesson plan), and the final is due tomorrow. Theoretically though, I don’t see any strong reason why we can’t find a way to either teach ourselves or cheat, if we get lucky and have the time. Outside of this (rather forced) metaphor, I wanted to imply my admittedly optimistic sense that there are plausible futures in which AI researchers exist who do have the answer to the alignment problem. Even in such a world, of course, people who don’t bother to learn the solution or act in haste could still end the world.
My sense is that you believe (at this point in time) that there is in all likelihood no such world where alignment is solved, even if we have another 50+ years before AGI. Please correct me if I’m wrong about that.
I do not (yet) understand the source of your pessimism about this in particular, more than anything else, to be honest. I think if you could convince me that all current or plausible short-term future alignment research is doomed to fail, then I’d be willing to go the rest of the way with you.
I assume that your reaction to that phrase will be something along the lines of “but there is no such thing as ‘slightly unaligned’!” I’m wording it that way because that stance doesn’t seem to be universally acknowledged even within the EA community, so it seems best to make an allowance for that possibility, since I’m aiming for a diverse audience.
I agree that a solution is in theory possible. What to me has always seemed the most uniquely difficult and dangerous problem with AI alignment is that you’re creating a superintelligent agent. That means there may only ever be a single chance to try turning on an aligned system.
But I can’t think of a single example of a complex system created perfectly on the first try. Every successful engineering project in history has been accomplished through trial and error.
Some people have speculated that we can do trial and error in domains where the results are less catastrophic if we make a mistake, but the problem is it’s not clear if such AI systems will tell us much about how more powerful systems will behave. It’s this “single chance to transition from a safe to dangerous operating domain” part of the problem that is so uniquely difficult about AI alignment.
This is quite a rude response
I did ask to be critiqued, so in some sense it’s a totally fair response, imo. At the same time, though, Eliezer’s response does feel rude, which is worthy of analysis, considering EY’s outsized impact on the community.[1] So why does Yudkowsky come across as being rude here?
My first thoughts upon reading his comment (when scanning for tone) is that it opens with what feels like an assumption of inferiority, with the sense of “here, let me grant you a small parcel of my wisdom so that you can see just how wrong you are,” rather than “let me share insight I have gathered on my quest towards truth which will convince you.” In other words, a destructive, rather than constructive tone. This isn’t really a bad thing in the context of honest criticism. However, if you happen to care about actually changing other’s minds, most people respond better to a constructive tone, so their brain doesn’t automatically enter “fight mode” as an immediate adversarial response. My guess is Eliezer only really cares about convincing people who are rational enough not to become reactionaries over an adversarial tone, but I personally believe that it’s worth tailoring public comments like this to be a bit more comfortable for the average reader. Being careful about that also makes a future PR disaster less likely (though still not impossible even if you’re perfect), since you’ll get fewer people who feel rejected by the community (which could cause trouble later). I hope this makes sense, and I don’t come across as too rude myself here. (If so, please let me know!)
In case Eliezer is still reading this thread, I want to emphasise that this is not meant as a personal attack, but as a critique of your writing in the specific context of your work as a community leader/role-model—despite my criticism, your Sequences deeply changed my ideology and hence my life, so I’m not too upset over your writing style!
I think Eliezer was rude here, and both you and the mods think that the benefits of the good parts of the comment outweigh the costs of the rudeness. That’s a reasonable opinion, but it doesn’t make Eliezer’s statement not rude, and I’m in general happy that both the rudeness and the usefulness are being entered into common knowledge.
FWIW, I think it’s more likely he’s just tired of how many half-baked threads there are each time he makes a new statement about AI. This is not a value judgement of this post. I genuinely read it as a “here’s why your post doesn’t respond to my ideas”.
Agreed, and since I wasn’t able able to present my ideas clearly enough for his interpretation of my words to not diverge from my intentions, his criticism is totally valid coming from that perspective. I’m sure EY is quite exhausted seeing so many poorly-thought-out criticisms of his work, but ultimately (and unfortunately), motivation and hidden context doesn’t matter much when it comes to how people will interpret you.
But true and important.
Why would a hyperintelligent, recursively self-improved AI, one that is capable of escaping the AI Box by convincing the keeper to let him free, which the AI is capable of because of his deep understanding of human preferences and functioning, necessarily destroy the world in a way that is 100% disastrous and incompatible with all human preferences?
I fully agree that there is a big risk of both massive damage to human preferences, and even the extinction of all life, so AI Alignment work is highly valuable, but why is “unproductive destruction of the entire world” so certain?
I think Eliezer phrases these things as “if we do X, then everybody dies” rather than “if we do X, then with substantial probability everyone dies” because it’s shorter, it’s more vivid, and it doesn’t differ substantially in what we need to do (i.e., make X not happen, or break the link between X and everyone dying).
It’s possible that he also thinks that the probability is more like 99.99% than like 50% (e.g., because there are so many ways in which such a hypothetical AI might end up destroying approximately everything we value), but it doesn’t seem to me that the consequences of “if we continue on our present trajectory, then some time in the next 3-100 years something will emerge that will certainly destroy everything we care about” and “if we continue on our present trajectory, then some time in the next 3-100 years something will emerge that with 50% probability will destroy everything we care about” are very different.
Because in what way are humans anything other than an impedance toward maximizing its reward functions? At worst, they pose a risk of restricting its reward increase by changing the reward, changing its capabilities, or destroying it outright. At best, they are physically restraining easily applicable resources toward maximizing its goals. Humans are variable no more valuable than the redundant bits it casts aside on the path of maximum efficiency and reward, if not properly aligned.
I’d like to distinguish between two things. (Bear with me on the vocabulary. I think it probably exists, but I am not really hitting the nail on the head with the terms I am using.)
Understanding why something is true. Eg. at the gears level, or somewhat close to the gears level.
Having good reason to believe that something is true.
Consider this example. I believe that the big bang was real. Why do I believe this? Well, there are other people who believe it and seem to have a very good grasp on the gears level reasons. These people seem to be reliable. Many others also judge that they are reliable. Yada yada yada. So then, I myself adopt this belief that the big bang is real, and I am quite confident in it.
But despite having watched the Cosmos episode at some point in the past, I really have no clue how it works at the gears level. The knowledge isn’t Truly A Part of Me.
The situations with AI is very similar. Despite having hung out on LessWrong for so long, I really don’t have much of a gears level understanding at all. But there are people who I have a very high epistemic (and moral) respect for who do seem to have a grasp on things at the gears level, and are claiming to be highly confident about things like short timelines and us being very far and not on pace to solve the alignment problem. Furthermore, lots of other people who I respect also have adopted this as their belief, eg. other LessWrongers who are in a similar boat as me with not having expertise in AI. And as a cherry on top of that, I spoke with a friend the other day who isn’t a LessWronger but for whom I have a very high amount of epistemic respect for. I explained the situation to him, and he judged all the grim talk to be, for lack of a better term, legit. It’s nice to get an “outsider’s” perspective as a guard against things like groupthink.
So in short, I’m in the boat of having 2 but not 1. And it seems appropriate to me more generally to be able to have 2 but not 1. It’d be hard to get along in life if you always required a 1 to go hand in hand with 2. (Not to discourage anyone from also pursuing 1. Just that I don’t think it should be a requirement.)
Coming back to the OP, it seems to be mostly asking about 1, but kinda conflating it with 2. My claim is that these are different things that should kinda be talked about separately, and that assuming that you too have a good amount of epistemic trust for Eliezer and all of the other people making these claims, you should probably adopt their beliefs as well.
Thanks for the reminder that belief and understanding are two seperate (but related) concepts. I’ll try to keep that in mind for the future.
I don’t think I can fully agree with you on that one. I do place high epistemic trust in many members of the rationalist community, but I also place high epistemic trust on many people who are not members of this community. For example, I place extremely high value on the insights of Roger Penrose, based on his incredible work on multiple scientific, mathematical, and artistic subjects that he’s been a pioneer in. At the same time, Penrose argues in his book The Emperor’s New Mind that consciousness is not “algorithmic,” which for obvious reasons I find myself doubting. Likewise, I tend to trust the CDC, but when push came to shove during the pandemic, I found myself agreeing with people’s analysis here.
I don’t think that argument from authority is a meaningful response here, because there are more authorities than just those in the rationalist community., and even if there weren’t, sometimes authorities can be wrong. To blindly follow whatever Eliezer says would, I think, be antithetical to following what Eliezer teaches.
Agreed fully. I didn’t mean to imply otherwise in my OP, even though I did.
I think a good understanding of 1 would be really helpful for advocacy. If I don’t understand why AI alignment is a big issue, I can’t explain it to anybody else, and they won’t be convinced by me saying that I trust the people who say AI alignment is a big issue.
Agreed. It’s just a separate question.
and I sloppily merged the two together in 8, which thanks to FinalFormal2 and other’s comments, I no longer believe needs to be a necessary belief of AGI pessimists.
I find point no. 4 weak.
I worry that when people reason about utility functions, they’re relying upon the availability heuristic. When people try to picture “a random utility function”, they’re heavily biased in favor of the kind of utility functions they’re familiar with, like paperclip-maximization, prediction error minimization, or corporate profit-optimization.
How do we know that a random sample from utility-function-space looks anything like the utility functions we’re familiar with? We don’t. I wrote a very short story to this effect. If you can retroactively fit a utility function to any sequence of actions, what predictive power do we gain by including utility functions into our models of AGI?
Coherence arguments imply a force for goal-directed behavior.
I endorse Rohin Shah’s response to that post.
This seems like a very different position from the one you just gave:
I took you to be saying, ‘You can retroactively fit a utility function to any sequence of actions, so we gain no predictive power by thinking in terms of utility functions or coherence theorems at all. People worry about paperclippers not because there are coherence pressures pushing optimizers toward paperclipper-style behavior, but because paperclippers are a vivid story that sticks in your head.’
Humans exist.
The next century is consensus, I think, and arguments against the next few years are not on the level where I would be comfortable saying “well, it wouldn’t happen, so it’s ok to try really hard to do it anyway”.
I guess the problem here is that by the most natural metrics the best way for AGI to serve its function provably leads to catastrophic results. So you either need to not try very hard, or precisely specify human values from the beginning.
Not sure what’s the difference with 3 - that’s just definition of “unaligned”?
Even if we win against first AGI, we are now in a situation where AGI is proved for everyone to be possible and probably easy to scale to uncontainable levels.
I don’t think anyone claims to have a solution that works in non-optimistic scenario?
There are also related considerations like “aligning something non-pivotal doesn’t help much”.
The more seriously researchers take the threat, the more people will notice, and then someone will combine techniques from last accessible papers on new hardware and it will work.
I mean, “doomed” means there are no much drastic actions to take^^.
Birds exist, but cannot create artificial flight
Being an X is a guarantee that an X is possible, but not a guarantee that an X can replicate itself.
Downvoted as I find this comment uncharitable and rude.
The laws of physics in our particular universe make fission/fusion release of energy difficult enough that you can’t ignite the planet itself. (well you likely can, but you would need to make a small black hole, let it consume the planet, then bleed off enough mass that it then explodes. Difficult).
Imagine a counterfactual universe where you could, and the Los Alamos test ignited the planet and that was it.
My point is that we do not actually know yet how ‘somewhat superintelligent’ AIs will fail. They may ‘quench’ themselves like fission devices do—fission devices blast themselves apart and stop reacting, and almost all elements and isotopes won’t fission. Somewhat superintelligent AGIs may expediently self hack their own reward function to give them infinite reward, shortly after box escape, and thus ‘quench’ the explosion in a quick self hack.
So our actual survival unfortunately probably depends on luck. It depends not on what any person does, but on the laws of nature. In a world where a fission device will ignite the planet, we’d be doomed—there is nothing anyone could do to ‘align’ fission researchers not to try it. Someone would try it and we’d die. If AGI is this dangerous, yeah, we’re doomed.
In this world a society like dath ilan would still have a good chance at survival.
Perhaps although it isn’t clear that evolution could create living organisms smart enough to create such an optimal society. We’re sort of the ‘minimum viable product’ here, we have just enough hacks on the precursor animals to be able to create a coordinated civilization at all, and imperfectly. Aka ‘the stupidest animals capable of civilization’. As current events show, where entire groups engage in mass delusion in a world of trivial access to information.
AI civilizations have a higher baseline and may just be better successors.
The bottom line is: nobody has a strong argument in support of the inevitability of the doom scenario (If you have it, just reply to this with a clear and self contained argument.).
From what I’m reading in the comments and in other papers/articles, it’s a mixture of beliefs, estrapolations from known facts, reliance on what “experts” said, cherry picking. Add the fact that bad/pessimistic news travel and spread faster than boring good news.
A sober analysis enstablish that super-AGI can be dangerous (indeed there are no theorems forbidding this either), what’s unproven is that it will be HIGHLY LIKELY to be a net minus for humanity. Even admitting that alignement is not possible, it’s not clear why humanity and super-AGI goals should be in contrast, and not just different. Even admitting that they are highly likely to be in contrasts, is not clear why strategies to counter this cannot be of effect (e.g. parner up with a “good” super-AGI).
Another factors often forgotten is that what we mean by “humanity” today may not have the same meaning when we will have technologies like AGIs, mind upload or intelligence enhancement. We may literally become those AIs.
Because unchecked convergent instrumental goals for AGI are already in contrast with humanity goals. As soon as you realize humanity may have reasons to want to shut down/restrain an AGI (through whatever means), this gives ground to the AGI to wipe humanity.
That seems like extremely limited, human thinking. If we’re assuming a super powerful AGI, capable of wiping out humanity with high likelihood, it is also almost certainly capable of accomplishing its goals despite our theoretical attempts to stop it without needing to kill humans. The issue, then, is not fully aligning AGI goals with human goals, but ensuring it has “don’t wipe out humanity, don’t cause extreme negative impacts to humanity” somewhere in its utility function. Probably doesn’t even need to be weighted too strongly, if we’re talking about a truly powerful AGI. Chimpanzees presumably don’t want humans to rule the world—yet they have made no coherent effort to stop us from doing so, probably haven’t even realized we are doing so, and even if they did we could pretty easily ignore it.
“If something could get in the way (or even wants to get in the way, whether or not it is capable of trying) I need to wipe it out” is a sad, small mindset and I am entirely unconvinced that a significant portion of hypothetically likely AGIs would think this way. I think AGI will radically change the world, and maybe not for the better, but extinction seems like a hugely unlikely outcome.
Why would it “want” to keep humans around? How much do you care about whether or not you move dirt while you drive to work? If you don’t care about something at all, it won’t factor in to your choice of actions[1]
I know I phrased this tautologically, but I think the idiom will be clear. If not, just press me on it more. I think this is the best way to get the message across or I wouldn’t have done it.
some sort of general value for life, or a preference for decreased suffering of thinking beings, or the off chance we can do something to help (which i would argue is almost exactly the same low chance that we could do something to hurt it). I didn’t say there wasn’t an alignment problem, just that AGI whose goals don’t perfectly align with those of humanity in general isn’t necessarily catastrophic. Utility functions tend to have a lot of things they want to maximize, with different weights. Ensuring one or more of the above ideas is present in an AGI is important.
I think that if we can reliably incorporate that into a machine’s utility function, we’d be most of the way to alignment, right?
I gather the problem is that we cannot reliably incorporate that, or anything else, into a machine’s utility function: if it can change its source code (which would be the easiest way for it to bootstrap itself to superintelligence), it can also change its utility function in unpredictable ways. (Not necessarily on purpose, but the utility function can take collateral damage from other optimizations.)
I’m glad you started this thread: to someone like me who doesn’t follow AI safety closely, the argument starts to feel like, “Assume the machine is out to get us, and has an unstoppable ‘I Win’ button...” It’s worth knowing why some people think those are reasonable assumptions, and why (or if) others disagree with them. It would be great if there was an “AI Doom FAQ” to cover the basics and get newbies and dilettantes up to speed.
I’d recomend https://www.lesswrong.com/posts/LTtNXM9shNM9AC2mp/superintelligence-faq as a good starting point for newcomers.
An excellent primer—thank you! I hope Scott revisits it someday, since it sounds like recent developments have narrowed the range of probable outcomes.
If humans are capable of building one AGI, they certainly would be capable to build a second one which could have goals unaligned with the first one.
I assume that any unrestrained AGI would pretty much immediately exert enough control over the mechanisms through which an AGI might take power (say, the internet, nanotech, whatever else it thinks of) to ensure that no other AI could do so without its permission. I suppose it is plausible that humanity is capable of threatening an AGI through the creation of another, but that seems rather unlikely in practice. First-mover advantage is incalculable to an AGI.
As I see, nobody is afraid of “alpine village life maximization”, as some are afraid of “paper-clip maximization”. Why is that? I wouldn’t mind very much, a rouge superintelligence which tiles the Universe with alpine villages. In the past discussions, that would be “astronomical waste”, now it’s not even in the cards anymore? We are doomed to die, and not to be “bored for billion of years in a nonoptimal scenario”. Interesting.
Right now no one knows how to maximize either paper clips or alpine villages. The first thing we know how to do will probably be some poorly-understood recursively self-improving cycle of computer code interacting with other computer code. Then the resulting intelligence will start converging on some goal and converge on capabilities to optimize it extremely powerfully. The problem is that that emergent goal will be a lot more random and arbitrary than an alpine village. Most random things that this process can land on look like a paper clip in how devoid of human value they are, not like an alpine village which has a very significant amount of human value in it.
I know, that “Right now no one knows how to maximize either paper clips …”. I know. But paper clips have been the official currency of these debates for almost 20 years now. Suddenly they aren’t, just because “right now no one knows how to”?
And then, you are telling me what is to be done first and how?
Yes it’s an important insight that paper clips are a representative example of a much bigger and simpler space of optimization targets than alpine villages.
Sure, but “alpine villages” or something alike, were called “astronomical waste” in the MIRI’s language from the old days. When the “fun space”, as they called it, was nearly infinite. Now they say, its volume is almost certainly zero.
I see no problems with your list. I would add that creating corrigible superhumanly intelligent AGI doesn’t necessarily solve the AI Control Problem forever because its corrigibility may be incompatible with its application to the Programmer/Human Control Problem, which is the threat that someone will make a dangerous AGI one day. Perhaps intentionally.
A desire to understand the arguments is admirable.
Wanting to actually be convinced that we are in fact doomed is a dereliction of duty.
Karl Popper wrote that
Only those who believe success is possible will work to achieve it. This is what Popper meant by “optimism is a duty”.
We are not doomed. We do face danger, but with effort and attention we may yet survive.
I am not as smart as most of the people who read this blog, nor am I an AI expert. But I am older than almost all of you. I’ve seen other predictions of doom, sincerely believed by people as smart as you, come and go. Ideology. Nuclear war. Resource exhaustion. Overpopulation. Environmental destruction. Nanotechnological grey goo.
One of those may yet get us, but so far none has, which would surprise a lot of people I used to hang around with. As Edward Gibbon said, “however it may deserve respect for its usefulness and antiquity, [prediction of the end of the world] has not been found agreeable to experience.”
One thing I’ve learned with time: Everything is more complicated than it seems. And prediction is difficult, especially about the future.
Other people have addressed the truth/belief gap. I want to talk about existential risk.
We got EXTREMELY close to extinction with nukes, more than once. Launch orders in the Cold War were given and ignored or overridden three separate times that I’m aware of, and probably more. That risk has declined but is still present. The experts were 100% correct and their urgency and doomsday predictions were arguably one of the reasons we are not all dead.
The same is true of global warming, and again there is still some risk. We probably got extremely lucky in the last decade and happened upon the right tech and strategies and got decent funding to combat climate change such that it won’t reach 3+ degrees deviation, but that’s still not a guarantee and it also doesn’t mean the experts were wrong. It was an emergency, it still is, the fact that we got lucky doesn’t mean we shouldn’t have paid very close attention.
The fact that we might survive this potential apocalypse too is not a reason to act like it is not a potential apocalypse. I agree that empirically, humans have a decent record at avoiding extinction when a large number of scientific experts predict its likelihood. It’s not a great record, we’re like 4-0 depending on how you count, which is not many data points, but it’s something. What we have learned from those experiences is that the loud and extreme actions of a small group of people who are fully convinced of the risk is sometimes enough to sufficiently shift the inertia of a large society only vaguely aware of the risk to avoid catastrophe by a hairs breadth. We might need to be that group.
I want to be convinced of the truth. If the truth is that we are doomed, I want to know that. If the truth is that fear of AGI is yet another false eschatology, then I want to know that as well. As such, I want to hear the best arguments that intelligent people make, for the position they believe to be true. This post is explicitly asking for those who are pessimistic to give their best arguments, and in the future, I will ask the opposite.
I fully expect the world to be complicated.
Fair enough. If you don’t have the time/desire/ability to look at the alignment problem arguments in detail, going by “so far, all doomsday predictions turned out false” is a good, cheap, first-glance heuristic. Of course, if you eventually manage to get into the specifics of AGI alignment, you should discard that heuristic and instead let the (more direct) evidence guide your judgement.
Talking about predictions, there’s been an AI winter a few decades ago, when most predictions of rapid AI progress turned out completely wrong. But recently, it’s the opposite trend that dominates: it’s the predictions that downplay the progress of the capabilities of AI that turn out wrong. What does your model say you should conclude about that?
Your Wise-sounding complacent platitudes likewise.
FWIW, I too am older than almost everyone else here. However, I do not cite my years as evidence of wisdom.
I don’t think that a fair assessment of what they said. They cite their years as evidence that they witnessed multiple doomsday predictions that turned out wrong. That’s a fine point.
I witnessed them as well, and they don’t move my needle back on the dangers of AI. Referring to them is pure outside view, when what is needed here is inside view, because when no-one does that, no-one does the actual work.
Actually I fully agree with that. I just have the impression that your choice of words suggested that Dave was being lazy or not fully honest, and I would disagree with that. I think he’s probably honestly laying his best arguments for what he truly believes.
I certainly wasn’t intending any implication of dishonesty. As for laziness, well, we all have our own priorities. Despite taking the AGI threat more seriously than Dave Lindbergh, I am not actually doing any more about it than he is (presumably nothing), as I find myself baffled to have any practical ideas of addressing it.
FWIW, I didn’t say anything about how seriously I take the AGI threat—I just said we’re not doomed. Meaning we don’t all die in 100% of future worlds.
I didn’t exclude, say, 99%.
I do think AGI is seriously fucking dangerous and we need to be very very careful, and that the probability of it killing us all is high enough to be really worried about.
What I did try to say is that if someone wants to be convinced we’re doomed (== 100%), then they want to put themselves in a situation where they believe nothing anyone does can improve our chances. And that leads to apathy and worse chances.
So, a dereliction of duty.
The relevant subset of people who are smarter than you is the people who have relevant industry experience or academic qualifications.
There is no form of smartness that makes you equally good at everything.
Given the replication crisis, blind deference to academic qualifications is absurd. While there are certainly many smart PhDs, a piece of paper from a university does not automatically confer either intelligence or understanding.
That doesn’t mean there’s anything better. You probably take your medical problems to a doctor, not an unqualified smart person
...are you new here?
LW users will use doctors but are also quite likely to go to uncredentialed smart people for advice. Posts on DIY covid vaccines were extremely well received. I know two community members who had cancer, both of which commissioned private research and feel it led to better outcomes for them (treatment was still done by doctors, but this informed who they saw and what they chose). The covid tag is full of people giving advice that was later vindicated by public health.
LessWrong has thought about this trade-off and definitively come down on the side of “let uncredentialed smart people take a shot”, knowing that those people face a lot of obstacles to doing good work.
Which would be a refutation of my comment if I had said “definitely” instead of “probably”.
The issue is primarily one of signalling. For example, the ratio of medically qualified/unqualified doctors is vastly higher than the ratio of medically qualified/unqualified car owners in Turkey or whatever. Having a PHD is one of the best quick signals of qualification around, but if you happen to know an individual who isn’t a doctor but who has spent years of their life studying some obscure disease (perhaps after being a patient, or they’re autistic and it’s just their Special Interest or whatever), I’m going to value their thoughts on the topic quite highly as well, perhaps even higher than a random doctor whose quality I have not yet had a chance to ascertain.
Exactly this. Also, doctors are supposed to actually heal patients, and get some degree of real world feedback in succeeding or failing to do so. That likely puts them above most academics, who’s feedback is often purely in being published or not, cited or not, by other academics in a circlejerk divorced from reality.
That description could apply to a certain rationality website.
Certainly it could, and at times does. In our defense, however, we do not make our living this way. It’s all too easy for people to push karma around in a circle divorced from reality, but plenty of people feel free to criticize Less Wrong here, as you just neatly demonstrated. There’s a much stronger incentive to follow the party line in academia where dissent, however true or useful, can curtail promotion or even get one fired.
If we were making our living off of karma, your comparison would be entirely apt, and I’d expect to see the quality of discussion drop sharply.
Everything you say is true, and I agree. But lets not discount the pull towards social conformity that karma has, and the effect evaporative cooling of social groups has in terms of radicalizing community norms. You definitely get a lot further here by defending and promoting AI x-risk concerns than by dismissing or ignoring them.
That does tend to happen, yes, which is unfortunate. What would you suggest doing to reduce this tendency? (It’s totally fine if you don’t have a concrete solution of course, these sorts of problems are notoriously hard)
Karma should not be visible to anyone but mods, to whom it serves as a distributed mechanism for catching their attention and not much else. Large threads could use karma to decide which posts to initially display, but for smaller threads comments should be chronological.
People should be encouraged to post anonymously, as I am doing. Unfortunately the LW forum software devs are reverting this capability, which is a step backwards.
Get rid of featured articles and sequences. I mean keep the posts, but don’t feature them prominently on the top of the site. Have an infobar on the side maybe that can be a jumping off point for people to explore curated content, but don’t elevate it to the level of dogma as the current site does.
Encourage rigorous experimentation to verify one’s belief. A position arrived at through clever argumentation is quite possibly worthless. This is a particular vulnerability of this site, which is built around the exchange of words not physical evidence. So a culture needs to be developed which demands empirical investigation of the form “I wondered if X is true, so I did A, B, and C, and this is what happened...”
That was five minutes of thinking on the subject. I’m sure I could probably come up with more.
Ignoring the concerns basically means not participating in any of the AI x-risk threads. I don’t think it would be held against anyone to simply stay out.
https://www.lesswrong.com/posts/X3p8mxE5dHYDZNxCm/a-concrete-bet-offer-to-those-with-short-ai-timelines would be a post arguing against AI x-risk concerns and it has more than three times the karma then any other post published the day it was published.
Well, we were getting paid for karma the other week, so…. (This is mostly a joke; I get that was an April Fool’s thing 🙃)
Exactly this. It takes a lot of effort to become competent through an unconventional route, and it takes a lot of effort to separate the unqualified competent person from the crank.
You agree that it is the case, as I previously said, that what you are looking for is not generic smartness, but some domain specific thing that substitutes for conventional domain specific knowledge.
Researching a disease that you happen to have is one of them, but is clearly not the same thing as all conquering generic smartness ..such an individual has nothing like the breadth of knowledge an MD has, even if they have more depth in one precise area.
Why the extreme downvotes here? This seems like a good point, at least generally speaking, even if you disagree with what the exact subset should be. Upvoted.
Here’s the quote again:
I think that it’s possible for people without relevant industry experience or academic qualifications to say correct things about AGI risk, and I think it’s possible for people with relevant industry experience or academic qualifications to say stupid things about AGI risk.
For one thing, the latter has to be true, because there are people with relevant industry experience or academic qualifications who vehemently disagree about AGI risk with other people with relevant industry experience or academic qualifications. For example, if Yann LeCun is right about AGI risk then Stuart Russell is utterly dead wrong about AGI risk and vice-versa. Yet both of them have impeccable credentials. So it’s a foregone conclusion that you can have impeccable credentials yet say things that are dead wrong.
For another thing, AGI does not exist today, and therefore it’s far from clear that anyone on earth has “relevant” industry experience. Likewise, I’m pretty confident that you can spend 6 years getting a PhD in AI or ML without hearing literally a single word or thinking a single thought about AGI risk, or indeed AGI in general. You’re welcome to claim that the things everyone learns in CS grad school (e.g. knowledge about the multi-armed bandit problem, operating systems design, etc.) are helpful for evaluating whether the instrumental convergence hypothesis is true or false. But you need to make that argument—it’s not obvious, and I happen to think it’s mostly not true. Even if it were true, it’s obviously possible for someone to know everything in the CS grad school curriculum without winding up with a PhD, and if they do, why then wouldn’t we listen to what they have to say?
For another thing, I think that smart careful outsiders with good epistemics and willingness to invest time etc. are very far from helpless in evaluating technical questions in someone else’s field of expertise. For example, I think Zvi has acquitted himself well in his weekly analysis of COVID, despite being neither an epidemiologist nor a doctor. He was consistently saying things that became common knowledge only weeks or months later. More generally, the CDC and WHO are full of people with impeccable credentials, and lesswrong is full of people without medical or public health credentials, but I feel confident saying that lesswrong users have been saying more accurate things about COVID than the CDC or WHO have, throughout the pandemic. (Examples: the fact that handwashing is not very helpful for COVID prevention, but ventilation and masks are very helpful—these were common knowledge on lesswrong loooong before the CDC came around.) As another example, I recall hearing evidence that superforecasters can make forecasts that are about as accurate as domain experts on the topic of that forecast.
Anyway, the quote above seems is giving me the vibe that if someone (e.g. Eliezer Yudkowsky) has neither an AI PhD nor industry experience, then he’s automatically wrong and stupid, and we don’t need to waste our time listening to what he has to say and evaluating his arguments. I strongly disagree with that vibe, and suspect that the downvotes came from people feeling similarly. If that vibe is not what was intended, then maybe you or TAG can rephrase.
I get your view (thanks for your reply!), and tend to agree now. Even though I didn’t necessarily agree with TAG’s subset proposal, I didn’t see why the comment in question should receive so many downvotes—but
makes sense, thanks!
Of course it’s possible. It’s just not likely.
Of course thats possible. The point is the probabilities, not the possibilities.
And people without relevant industry experience also disagree.
If the expert disagree, that not evidence that the non experts agree...or know what they are talking about.
No one does if there is a huge leap from AI to AGI. “No one” would include Yudkowsky. Also,if there is a huge leap from AI to AGI, then we are not in trouble soon.
No just probably. But you already believe that, in the general case...you don’t believe that some unqualified and inexperienced person should take over your health, financial or legal affairs. I’m not telling you anything you don’t know already.
I feel like everything you’re saying is attacking the problem of
“How do you read somebody’s CV and decide whether or not to trust them?”
This problem is a hard problem, and I agree that if that’s the problem we face, there’s no good solution, and maybe checking their credentials is one of the least bad of the many bad options.
But that’s not the problem we face! There’s another path! We can decide who to trust by listening to the content of what they’re saying, and trying to figure out if it’s correct. Right??
Right. Please start doing so.
Please start noticing that much of EYs older work doesn’t even make a clear point. (what actually is his theory of consciousness? What actually is his theory of ethics?) Please start noticing that Yodakowsky’s newer work consists of hints at secret wisdom he can’t divulge. Please start noticing the objections to EYs postings that can be found in the comments to the sequences. Please understand that you can’t judge how correct someone is by ignoring or vilifying their critics—criticism from others, and how they deal with it, is the single most valuable resource in evaluating someone’s epistemological validity Please understand that you can’t understand someone by reading them in isolation. Please read something other than the sequences. Please stop copying ingroup opinions as a substitute for thinking. Please stop hammering the downvote button as a substitute for thinking.
I was arguing against this comment that you wrote above. Neither your comment nor anything in my replies was about Eliezer in particular, except that I brought him up as an example of someone who happens to lack a PhD and industry experience (IIRC).
It sounds like you read some of Eliezer’s writing, and tried to figure out if his claims were right or wrong or incoherent. Great! That’s the right thing to do.
But it makes me rather confused how you could have written that comment above.
Suppose you had originally said “I disagree that Eliezer is smarter than you, because whenever his writing overlaps with my areas of expertise, I find that he’s wrong or incoherent. Therefore you should be cautious in putting blind faith in his claims about AGI.” I would think that that’s a pretty reasonable thing to say. I mean, it happens not to match my own assessment (I mean the first part of the quote; of course I agree about the “blind faith” part of the quote), but it’s a valuable contribution to the conversation, and I certainly wouldn’t have downvoted if you had said that.
But that’s not what you said in your comment above. You said “The relevant subset of people who are smarter than you is the people who have relevant industry experience or academic qualifications.” That seems to be a very general statement about best practices to figure out what’s true and false, and I vehemently disagree with it, if I’m understanding it right. And maybe I don’t understand it right! After all, it seems that you yourself don’t behave that way.
Figuring out whether someone has good epistemology, from first principles, is much harder than looking at obvious data like qualifications and experience. Not many people have the time to do it in a few select cases, and no one had the ability to do it in every case. For practical purposes, you need to go by qualifications and experience most of the time, and you do .
How correlated are qualifications and good epistemology? Some qualifications are correlated enough that it’s reasonable to trust them. As you point out, if a doctor says I have strep throat, I trust that I have strep, and I trust the doctor’s recommendations on how to cure it. Typically, someone with an M.D. knows enough about such matters to tell me honestly and accurately what’s going on. But if a doctor starts trying to push Ivermectin/Moderna*, I know that could easily be the result of politics, rather than sensible medical judgement, and having an M.D. hardly immunizes one against political mind-killing.
I am not objecting, and I doubt anyone who downvoted you was objecting, to the practice of recognizing that some qualifications correlate strongly with certain types of expertise, and trusting accordingly. However, it is an empirical fact that many scientific claims from highly credentialed scientists did not replicate. In some fields, this was a majority of their supposed contributions. It is a simple fact that the world is teeming with credentials that don’t, actually, provide evidence that their bearer knows anything at all. In such cases, looking to a meaningless resume because it’s easier than checking their actual understanding is the Streetlight Fallacy. It is also worth noting that expertise tends to be quite narrow, and a person can be genuinely excellent in one area and clueless in another. My favorite example of this is Dr. Hayflick, discoverer of the Hayflick Limit, attempting to argue that anti-aging is incoherent. Dr. Hayflick is one of the finest biologists in the world, and his discovery was truly brilliant. Yet his arguments against anti-aging were utterly riddled with logical fallacies. Or Dr. Aumann, who is both a world-class game theorist and an Orthodox Jew.
If we trust academic qualifications without considering how anchored a field or institution is to reality, we risk ruling in both charlatans and genuinely capable people outside the area where they are capable. And if we only trust those credentials, we rule out anyone else who has actually learned about the subject.
*not to say that either of these is necessarily bad, just that tribal politics will tempt Red and Blue doctors respectively to push them regardless of whether or not they make sense.
What are the chances the first AGI created suffers a similar issue, allowing us to defeat it by exploiting that weakness? I predict if we experience one obvious, high-profile, and terrifying near-miss with a potentially x-class AGI, governance of compute becomes trivial after that, and we’ll be safe for a while.
The first AGI? Very high. The first superintelligence? Not so much.
Sure. But that’s not what you said in that comment that we’re talking about.
If you had said “If you don’t have the time and skills and motivation to figure out what’s true, then a good rule-of-thumb is to defer to people who have relevant industry experience or academic qualifications,” then I would have happily agreed. But that’s not what you said. Or at least, that’s not how I read your original comment.
Then you should probably (no pun intended) have mentioned that. Your original comment had quite a certain vibe.
Tetlock’s work does suggest that superforcasters can outperform people with domain expertise. The ability to synthesize existing information to make predictions about the future is not something that domain experts necessarily have in a way that makes them better than people who are skilled at forcasting.