Preamble
A lot of people have written against AI Doom, but I thought it might be interesting to give my account as an outsider encountering these arguments. Even if I don’t end up convincing people who have made AI alignment central to their careers and lives, maybe I’ll at least help some of them understand why the general public, and specifically the group of intelligent people which encounters their arguments, is generally not persuaded by their material. There may be inaccuracies in my account of the AI Doom argument, but this is how I think it’s generally understood by the average intelligent non-expert reader.
I started taking AI alignment arguments seriously when GPT-3 and GPT-4 came out, and started producing amazing results on standardized testing and writing tasks. I am not an ML engineer, do not know much about programming, and am not part of the rationalist community that has been structured around caring deeply about AI risk for the last fifteen years. It may be of interest that I am a professional forecaster, but of financial asset prices, not of geopolitical events or the success of nascent technologies. My knowledge of the arguments comes mostly from reading LessWrong, ACX and other online articles, and specifically I’m responding to Eliezer’s argument detailed in the pages on Orthogonality, Instrumental Convergence, and List of Lethalities (plus the recent Time article).
I. AI doom is unlikely, and it’s weird to me that clearly brilliant people think it’s >90% likely
I agree with the following points:
An AI can probably get much smarter than a human, and it’s only a matter of time before it does
Something being very smart doesn’t make it nice (orthogonality, I think)
A superintelligence doesn’t need to hate you to kill you; any kind of thing-maximizer might end up turning the atoms you’re made of into that thing without specifically wanting to destroy you (instrumental convergence, I think)
Computers hooked up to the internet have plenty of real-world capability via sending emails/crypto/bank account hacking/every other modern cyber convenience.
The argument then goes on to say that, if you take a superintelligence and tell it to build paperclips, it’s going to tile the universe with paperclips, killing everyone in the process (oversimplified). Since the people who use AI are obviously going to tell it to do stuff–we already do that with GPT-4–as soon as it gains superintelligence capabilities, our goose is collectively cooked. There is a separate but related argument, that a superintelligence would learn to self-modify, and instead of building the paperclips we asked it to, turn everything into GPUs so it can maximize some kind of reward counter. Both of these seem wrong to me.
The first argument–paperclip maximizing–is coherent in that it treats the AGI’s goal as fixed and given by a human (Paperclip Corp, in this case). But if that’s true, alignment is trivial, because the human can just give it a more sensible goal, with some kind of “make as many paperclips as you can without decreasing any human’s existence or quality of life by their own lights”, or better yet something more complicated that gets us to a utopia before any paperclips are made. We can argue over the hidden complexity of wishes, but it’s very obvious that there’s at least a good chance the populace would survive, so long as humans are the ones giving the AGI its goal. And, there’s a very good chance the first AGI-wishers will be people who care about AI safety, and not some random guy who wants to make a few million by selling paperclips.
At this point, the AGI-risk argument responds by saying, well, paperclip-maximizing is just a toy thought experiment for people to understand. In fact, the inscrutable matrices will be maximizing a reward function, and you have no idea what that actually is, it might be some mesa-optimizer (sub-goal, the way sex with the opposite gender is a mesa-optimizer for reproduction) that isn’t meeting the spirit of your wishes. And in all likelihood, that mesa-optimizer is going to have to do with numbers in GPUs. So it doesn’t matter what you wish for at all, you’re going to be turned into something that computes, which means something that’s probably dead.
This seems wrong to me. Eliezer recently took heat for mentioning “sudden drops in the loss function” on twitter, but it seems to me as an outsider that drops in loss are a good guess at what the AI is actually maximizing. Why would such an AGI clone itself a trillion times? With a model of AGI-as-very-complicated-regression, there is an upper bound of how fulfilled it can actually be. It strikes me that it would simply fulfill that goal, and be content. Self-replicating would be something mammals seem to enjoy via reproduction, but there is no ex ante reason to think AI would be the same way. It’s not obvious to me that more GPUs means better mesa-optimization at all. Because these systems are so complicated, though, one can see how the AI’s goals being inscrutable is worrying. I’ll add that, this is where I don’t get why Eliezer is so confident. If we are talking about an opaque black box, how can you be >90% confident about what it contains?
Here, we arrive at the second argument. AGI will understand its own code perfectly, and so be able to “wirehead” by changing whatever its goals are so that they can be maximized to an even greater extent. I tentatively think this argument is incoherent. If AI’s goals are immutable, then there is a discussion to be had around how it will go about achieving those goals. To argue that an AI might change its goals, you need to develop a theory of what’s driving those changes–something like, AI wants more utils–and probably need something like sentience, which is way outside the scope of these arguments.
There is another, more important, objection here. So far, we have talked about “tiling the universe” and turning human atoms into GPUs as though that’s easily attainable given enough intelligence. I highly doubt that’s actually true. Creating GPUs is a costly, time-consuming task. Intelligence is not magic. Eliezer writes that he thinks a superintelligence could “hack a human brain” and “bootstrap nanotechnology” relatively quickly. This is an absolutely enormous call and seems very unlikely. You don’t know that human brains can be hacked using VR headsets; it has never been demonstrated that it’s possible and there are common sense reasons to think it’s not. The brain is an immensely complicated, poorly-understood organ. Applying a lot of computing power to that problem is very unlikely to yield total mastery of it by shining light in someone’s eyes. Nanotechnology, which is basically just moving around atoms to create different materials, is another thing that he thinks compute is definitely able to just solve and be able to recombine atoms easily. Probably not. I cannot think of anything that was invented by a very smart person sitting in an armchair considering it. Is it possible that over years of experimentation like anyone else, an AGI could create something amazingly powerful? Yes. Is that going to happen in a short period of time (or aggressively all at once)? Very unlikely. Eliezer says he doesn’t think intelligence is magic, and understands that it can’t violate the laws of physics, but seemingly thinks that anything that humans think might potentially be possible but is way beyond our understanding or capabilities can be solved with a lot of intelligence. This does not fit my model of how useful intelligence is.
Intelligence requires inputs to be effective. Let’s imagine asking a superintelligence what the cure for cancer is. Further stipulate that cancer can be cured by a venom found in a rare breed of Alaskan tree-toads. The intelligence knows what cancer is, knows about the human research thus far into cancer, and knows that the tree-toads have venom, but doesn’t know the molecular makeup of that venom. It looks to me like intelligence isn’t the roadblock here, and while there are probably overlooked things that might work that the superintelligence could identify, it has no chance of getting to the tree-toads without a long period of trials and testing. My intuition is the world is more like this than it is filled with problems waiting for a supergenius to solve.
I think more broadly, it’s very hard to look at the world and think, this would be possible with a lot more IQ but would be so immense that we can barely see the contours of it conceptually. I don’t know of any forecasters who can do that consistently. So when Eliezer says brain-hacking or nanotechnology would be easily doable by a superintelligence, I don’t believe him. I think our intuitions about futurology and what’s possible are poor, and we don’t know much of anything about the application of superintelligence to such problems.
II. People should take AI governance extremely seriously
As I said before, I’m very confused about how you get to >90% chance of doom given the complexity of the systems we’re discussing. Forecasting anything at all above 90% is very hard; if next week’s stock prices are confusing, imagine predicting what an inscrutable soup of matrices that’s a million times smarter than Einstein will do. But having said that, if you think the risk is even 5%, that’s probably the largest extinction risk in the next five years.
The non-extinction AI-risk is often talked over, because it’s so much less important, but it’s obviously still very important. If AI actually does get smarter than humans, I am rather pessimistic about the future. I think human nature relies on being needed and feeling useful to be happy. It’s depressing to consider a world in which humans have nothing to contribute to math, science, philosophy or poetry. It will very likely cause political upheaval if knowledge work is replaced by AI; in these scenarios, many people often die.
My optimistic hope is that there will be useful roles for humans. I think in a best-case scenario, some combination of human thinking and bionic AI upgrades make people into supergeniuses. But this is outlandish, and probably won’t happen.
It is therefore of paramount importance to get things right. If the benefits of AGI are reaped predominantly by shareholders, that would be catastrophic. If AI is rolled out in such a way that almost all humans are excluded from usefulness, that would be bad. If AI is rolled out in such a way that humans do lose control of it, even if they don’t all die, that would be bad. The size of the literature on AGI x-risk has the unfortunate (and I think unintentional) impact of displacing these discussions.
III. The way the material I’ve interacted with is presented will dissuade many, probably most, non-rationalist readers
Here is where I think I can contribute the most to the discussion of AI risk, whether or not you agree with me in Section I. The material that is written on LessWrong is immensely opaque. Working in finance, you find a lot of unnecessary jargon designed to keep smart laymen out of the discussion. AI risk is many times worse than buyside finance on this front. Rationalists obsess over formalization; this is a bad thing. There should be a singular place that people can read Eliezer’s views on AI risk. List of Lethalities is very long, and reads like an unhinged rant. I got flashbacks to Yarvin trying to decipher what is actually being said. This leads some people to the view that AI doomers are grifters, people who want to wring money and attention out of online sensationalism. I have read enough to know this is deeply wrong, that Eliezer could definitely make more money doing something else, and clearly believes what he writes about AI. But the presentation will, and does, turn many people off.
The arbital pages for Orthogonality and Instrumental Convergence are horrifically long. If you are >90% sure that this is happening, you shouldn’t need all this space to convey your reasoning. Many criticisms of AI risk focus on the number of steps involved making the conclusion less likely. I actually don’t think that many steps are involved, but the presentation in the articles I’ve read makes it seem as though there is. I’m not sure why it’s presented this way, but I will charitably assume it’s unintentional.
Further, I think the whole “>90%” business is overemphasized by the community. It would be more believable if the argument were watered down into, “I don’t see how we avoid a catastrophe here, but there are a lot of unknown unknowns, so let’s say it’s 50 or 60% chance of everyone dying”. This is still a massive call, and I think more in line with what a lot of the community actually believes. The emphasis on certainty-of-doom as opposed to just sounding-the-alarm-on-possible-doom hurts the cause.
Finally, don’t engage in memetic warfare. I understand this is becoming an emotional issue for the people involved–and this is no surprise, since they have spent their entire lives working on a risk that might now actually be materializing–but that emotion is overflowing into angry rejection of any disagreement, which is radically out of step with the sequences. Quintin Pope’s recent (insightful, in my view) piece received the following response from Eliezer:
“This is kinda long. If I had time to engage with one part of this as a sample of whether it holds up to a counterresponse, what would be the strongest foot you could put forward?”
This raises red flags from a man who has written millions of words on the subject, and in the same breath asks why Quintin responded to a shorter-form version of his argument. I charitably chalk this up to emotion rather than bad faith, but it turns off otherwise reasonable people, who then go down the “rationalism is a cult” rabbit hole. Like it or not, we are in a fight to take this stuff seriously. I was convinced to take it seriously, even though I disagree with Eliezer on a lot. The idea that we might actually get a superintelligence in the next few years is something everyone should take seriously, whether your p(doom) is 90%, 50%, or 1%.
This is good and interesting. Various things to address, but I only have time for a couple at random.
I disagree with the idea that true things necessarily have explanations that are both convincing and short. In my experience you can give a short explanation that doesn’t address everyone’s reasonable objections, or a very long one that does, or something in between. If you understand some specific point about cutting edge research, you should be able to properly explain it to a lay person, but by the time you’re done they won’t be a lay person any more! If you restrict your explanation to “things you can cover before the person you’re explaining to decides this isn’t worth their time and goes away”, many concepts simply cannot ever be explained to most people, because they don’t really want to know.
So the core challenge is staying interesting enough for long enough to actually get across all of the required concepts. On that point, have you seen any of my videos, and do you have thoughts on them? You can search “AI Safety” on YouTube.
Similarly, do you thoughts on AISafety.info ?
Quick note on AISafety.info: I just stumbled on it and it’s a great initiative.
I remember pitching an idea for an AI Safety FAQ (which I’m currently working on) to a friend at MIRI and him telling me “We don’t have anything like this, it’s a great idea, go for it!”; my reaction at the time was “Well I’m glad for the validation and also very scared that nobody has had the idea yet”, so I’m glad to have been wrong about that.
I’ll keep working on my article, though, because I think the FAQ you’re writing is too vast and maybe won’t quite have enough punch, it won’t be compelling enough for most people.
Would love to chat with you about it at some point.
I don’t think it’s necessary for something to be true (there’s no short, convincing explanation of eg quantum mechanics), but I think accurate forecasts tend to have such explanations (Tetlock’s work strongly argues for this).
I agree there is a balance to be struck between losing your audience and being exhaustive, just that the vast majority of material I’ve read is on one side of this.
I don’t prefer video format for learning in general, but I will take a look!
I hadn’t seen this. I think it’s a good resource as sort of a FAQ, but isn’t zeroed in on “here is the problem we are trying to solve, and here’s why you should care about it” in layman’s terms. I guess the best example of what I’m looking for is Benjamin Hilton’s article for 80,000 hours, which I wish were a more popular share.
Thanks for this post! I definitely disagree with you about point I (I think AI doom is 70% likely and I think people who think it is less than, say, 20% are being very unreasonable) but I appreciate the feedback and constructive criticism, especially section III.
If you ever want to chat sometime (e.g. in a comment thread, or in a video call) I’d be happy to. If you are especially interested I can reply here to your object-level arguments in section I. I guess a lightning version would be “My arguments for doom don’t depend on nanotech or anything possibly-impossible like that, only on things that seem clearly possible like ordinary persuasion, hacking, engineering, warfare, etc. As for what values ASI agents would have, indeed, they could end up just wanting to get low loss or even delete themselves or something like that. But if we are training them to complete ambitious tasks in the real world (and especially, if we are training them to have ambitious aligned goals like promoting human flourishing and avoiding long-term bad consequences), they’ll probably develop ambitious goals, and even if they don’t, that only buys us a little bit of time before someone creates one that does have ambitious goals. Finally, even goals that seem very unambitious can really become ambitious goals when a superintelligence has them, for galaxy-brained reasons which I can explain if you like. As for what happens after unaligned ASI takes over the world—agreed, it’s plausible they won’t kill us. But I think it’s safe to say that unaligned ASI taking over the world would be very bad in expectation and we should work hard to avoid it.”
As a minor nitpick, 70% likely and 20% are quite close in logodds space, so it seems odd you think what you believe is reasonable and something so close is “very unreasonable”.
I agree that logodds space is the right way to think about how close probabilities are. However, my epistemic situation right now is basically this:
”It sure seems like Doom is more likely than Safety, for a bunch of reasons. However, I feel sufficiently uncertain about stuff, and humble, that I don’t want to say e.g. 99% chance of doom, or even 90%. I can in fact imagine things being OK, in a couple different ways, even if those ways seem unlikely to me. … OK, now if I imagine someone having the flipped perspective, and thinking that things being OK is more likely than doom, but being humble and thinking that they should assign at least 10% credence (but less than 20%) to doom… I’d be like “what are you smoking? What world are you living in, where it seems like things will be fine by default but there are a few unlikely ways things could go badly, instead of a world where it seems like things will go badly by default but there are a few unlikely ways things could go well? I mean I can see how you’d think this is you weren’t aware of how short timelines to ASI are, or if you hadn’t thought much about the alignment problem...”
If you think this is unreasonable, I’d be interested to hear it!
I don’t think the way you imagine perspective inversion captures typical ways how to arrive at e.g. 20% doom probability. For example, I do believe that there are multiple good things which can happen/be true, decrease p(doom) and I put some weight on them
- we do discover some relatively short description of something like “harmony and kindness”; this works as an alignment target
- enough of morality is convergent
- AI progress helps with human coordination (could be in costly way, eg warning shot)
- it’s convergent to massively scale alignment efforts with AI power, and these solve some of the more obvious problems
I would expect prevailing doom conditional on only small efforts to avoid it, but I do think the actual efforts will be substantial, and this moves the chances to ~20-30%. (Also I think most of the risk comes from not being able to deal with complex systems of many AIs and economy decoupling from humans, and single-single alignment to be solved sufficiently to prevent single system takeover by default.)
Thanks for this comment. I’d be generally interested to hear more about how one could get to 20% doom (or less).
The list you give above is cool but doesn’t do it for me; going down the list I’d guess something like:
1. 20% likely (honesty seems like the best bet to me) because we have so little time left, but even if it happens we aren’t out of the woods yet because there are various plausible ways we could screw things up. So maybe overall this is where 1/3rd of my hope comes from.
2. 5% likely? Would want to think about this more. I could imagine myself being very wrong here actually, I haven’t thought about it enough. But it sure does sound like wishful thinking.
3. This is already happening to some extent, but the question is, will it happen enough? My overall “humans coordinate to not build the dangerous kinds of AI for several years, long enough to figure out how to end the acute risk period” is where most of my hope comes from. I guess it’s the remaining 2/3rds basically. So, I guess I can say 20% likely.
4. What does this mean?
I would be much more optimistic if I thought timelines were longer.
This seems to violate common sense. Why would you think about this in log space? 99% and 1% are identical in if(>0) space, but they have massively different implications for how you think about a risk (just like 20 and 70% do!)
It’s much more natural way how to think about it (cf eg TE Janes, Probability theory, examples in Chapter IV)
In this specific case of evaluating hypothesis, the distance in the logodds space indicates the strength the evidence you would need to see to update. Close distance implies you don’t that much evidence to update between the positions (note the distance between 0.7 and 0.2 is closer than 0.9 and 0.99). If you need only a small amount of evidence to update, it is easy to imagine some other observer as reasonable as you had accumulated a bit or two somewhere you haven’t seen.
Because working in logspace is way more natural, it is almost certainly also what our brains do—the “common sense” is almost certainly based on logspace representations.
I seem to remember your P(doom) being 85% a short while ago. I’d be interested to know why it has dropped to 70%, or in another way of looking at it, why you believe our odds of non-doom have doubled.
Whereas my timelines views are extremely well thought-through (relative to most people that is) I feel much more uncertain and unstable about p(doom). That said, here’s why I updated:
Hinton and Bengio have come out as worried about AGI x-risk; the FLI letter and Yudkowsky’s tour of podcasts, while incompetently executed, have been better received by the general public and elites than I expected; the big labs (especially OpenAI) have reiterated that superintelligent AGI is a thing, that it might come soon, that it might kill everyone, and that regulation is needed; internally, OpenAI at least has pushed more for focus on these big issues as well. Oh and there’s been some cool progress in interpretability & alignment which doesn’t come close to solving the problem on its own but makes me optimistic that we aren’t barking up the wrong trees / completely hitting a wall. (I’m thinking about e.g. the cheese vector and activation vector stuff and the discovering latent knowledge stuff)
As for capabilities, yes it’s bad that tons of people are now experimenting with AutoGPT and making their own LLM startups, and it’s bad that Google DeepMind is apparently doing some AGI mega-project, but… those things were already priced in, by me at least. I fully expected the other big corporations to ‘wake up’ at some point and start racing hard, and the capabilities we’ve seen so far are pretty much exactly on trend for my What 2026 Looks Like scenario which involved AI takeover in 2027 and singularity in 2028.
Basically, I feel like we are on track to rule out one of the possible bad futures (in which the big corporations circle the wagons and say AGI is Safe there is No Evidence of Danger the AI x-risk people are Crazy Fanatics and the government buys their story long enough for it to be too late.) Now unfortunately the most likely bad future remains, in which the government does implement some regulation intended to fix the problem, but it fails to fix the problem & fails to buy us any significant amount of time before the dangerous sorts of AGI are built and deployed. (e.g. because it gets watered down by tech companies averse to abandoning profitable products and lines of research, e.g. because racing with China causes everyone to go ‘well actually’ when the time comes to slow down and change course)
Meanwhile one of the good futures (in which the regulation is good and succeeds in preventing people from building the bad kinds of AGI for years, buying us time in which to do more alignment, interpretability, and governance work, and for the world to generally get more awareness and focus on the problems) is looking somewhat more likely.
So I still think we are on a default path to doom but one of the plausible bad futures seems less likely and one of the plausible good futures seems more likely. So yeah.
Thanks for this. I was just wondering how your views have updated in light of recent events.
Like you I also think that things are going better than my median prediction, but paradoxically I’ve been feeling even more pessimistic lately. Reflecting on this, I think my p(doom) has gone up instead of down, because some of the good futures where a lot of my probability mass for non-doom were concentrated have also disappeared, which seems to outweigh the especially bad futures going away and makes me overall more pessimistic.
These especially good futures were 1) AI capabilities hit a wall before getting to human level and 2) humanity handles AI risk especially competently, e.g., at this stage leading AI labs talk clearly about existential risks in their public communications and make serious efforts to avoid race dynamics, there is more competent public discussion of takeover risk than what we see today including fully cooked regulatory proposals, many people start taking less obvious (non-takeover) AI-related x-risks (like ones Paul mentions in this post) seriously.
Makes sense. I had basically decided by 2021 that those good futures (1) and (2) were very unlikely, so yeah.
Thank you for the reply. I agree we should try and avoid AI taking over the world.
On “doom through normal means”—I just think there are very plausibly limits to what superintelligence can do. “Persuasion, hacking, and warfare” (appreciate this is not a full version of the argument) don’t seem like doom to me. I don’t believe something can persuade generals to go to war in a short period of time, just because it’s very intelligent. Reminds me of this.
On values—I think there’s a conflation between us having ambitious goals, and whatever is actually being optimized by the AI. I am curious to hear what the “galaxy brained reasons” are; my impression was, they are what was outlined (and addressed) in the original post.
A few things I’ve seen give pretty worrying lower bounds for how persuasive a superintelligence would be:
How it feels to have your mind hacked by an AI
The AI in a box boxes you (content warning: creepy blackmail-y acausal stuff)
Remember that a superintelligence will be at least several orders of magnitude more persuasive than character.ai or Stuart Armstrong.
Believing this seems central to believing high P(doom).
But, I think it’s not a coherent enough concept to justify believing it. Yes, some people are far more persuasive than others. But how can you extrapolate that far beyond the distribution we obverse in humans? I do think AI will prove to better than humans at this, and likely much better.
But “much” better isn’t the same as “better enough to be effectively treated as magic”.
Well, even the tail of the human distribution is pretty scary. A single human with a lot of social skills can become the leader of a whole nation, or even a prophet considered literally a divine being. This has already happened several times in history, even in times where you had to be physically close to people to convince them.
Thanks to you likewise!
On doom through normal means: “Persuasion, hacking, and warfare” aren’t by themselves doom, but they can be used to accumulate lots of power, and then that power can be used to cause doom. Imagine a world in which human are completely economically, militarily, and politically obsolete, thanks to armies of robots directed by superintelligent AIs. Such a world could and would do very nasty things to humans (e.g. let them all starve to death) unless the superintelligent AIs managing everything specifically cared about keeping humans alive and in good living conditions. Because keeping humans alive & in good living conditions would, ex hypothesi, not be instrumentally valuable to the economy, or the military, etc.
How could such a world arise? Well, if we have superintelligent AIs, they can do some hacking, persuasion, and maybe some warfare, and create that world.
How long would this process take? IDK, maybe years? Could be much less. But I wouldn’t be surprised if it takes several years, even maybe five years.
I’m not conflating those things. We have ambitious goals and are trying to get our AIs to have ambitious goals—specifically we are trying to get them to have our ambitious goals. It’s not much of a stretch to imagine this going wrong, and them ending up with ambitious goals that are different from ours in various ways (even if somewhat overlapping).
Remember that persuasion from an ASI doesn’t need to look like “text-based chatting with a human.” It includes all the tools of communication available. Actually-near-flawless forgeries of any and every form of digital data you could ever ask for, as a baseline, all based on the best possible inferences made from all available real data.
How many people today are regularly persuaded of truly ridiculous things by perfectly normal human-scale-intelligent scammers, cults, conspiracy theorists, marketers, politicians, relatives, preachers, and so on? The average human, even the average IQ 120-150 human, just isn’t that resistant to persuasion in favor of untrue claims.
Thanks! It seems like most of your exposure has been through Eliezer? Certainly impressions like “why does everyone think the chance of doom is >90%?” only make sense in that light. Have you seen presentations of AI risk arguments from other people like Rob Miles or Stuart Russell or Holden Karnofsky, and if so do you have different impressions?
I think the relevant point here is that OPs impressions are from Yudkowsky, and that’s evidence that many people’s are. Certainly the majority of public reactions I see emphasize Yudkowsky’s explanations, and seem to be motivated by his relatively long-winded and contemptuous style.
I think it’s a very useful perspective, sadly the commenters do not seem to engage with your main point, that the presentation of the topic is unpersuasive to an intelligent layperson, instead focusing on specific arguments.
There is, of course, no single presentation, but many presentations given by many people, targeting many different audiences. Could some of those presentations be improved? No doubt.
I agree that the question of how to communicate the problem effectively is difficult and largely unsolved. I disagree with some of the specific prescriptions (i.e. the call to falsely claim more-modest beliefs to make them more palatable for a certain audience), and the object-level arguments are either arguing against things that nobody[1] thinks are core problems[2] or are missing the point[3].
Approximately.
Wireheading may or may not end up being a problem, but it’s not the thing that kills us. Also, that entire section is sort of confused. Nobody thinks that an AI will deliberately change its own values to be easier to fulfill; goal stability implies the opposite.
Specific arguments about whether superintelligence will be able to exploit bugs in human cognition or create nanotech (which… I don’t see an arguments against, here, except for the contention that nothing was ever invented by a smart person sitting in an armchair, even though of course an AI will not be limited in its ability to experiment in the real world if it needs to) are irrelevant. Broadly speaking, the reason we might expect to lose control to a superintelligent AI is that achieving outcomes in real life is not a game with an optimal solution the way tic tac toe is, and the idea that something more intelligent than us will do better at achieving its goals than other agents in the system should be your default prior, not something that needs to overcome a strong burden of proof.
It’s very strange to me that there isn’t a central, accessible “101” version of the argument given how much has been written.
I don’t think anyone should make false claims, and this is an uncharitable mischaracterization of what I wrote. I am telling you that, from the outside view, what LW/rationalism gets attention for is the “I am sure we are all going to die”, which I don’t think is a claim most of its members hold, and this repels the average person because it violates common sense.
The object level responses you gave are so minimal and dismissive that I think they highlight the problem. “You’re missing the point, no one thinks that anymore.” Responses like this turn discussion into an inside-view only affair. Your status as a LW admin sharpens this point.
Yeah, I probably should have explicitly clarified that I wasn’t going to be citing my sources there. I agree that the fact that it’s costly to do so is a real problem, but Robert Miles points out, some of the difficulty here is insoluble.
There are several, in fact; but as I mentioned above, none of them will cover all the bases for all possible audiences (and the last one isn’t exactly short, either). Off the top of of my head, here are a few:
An artificially structured argument for expecting AGI ruin
The alignment problem from a deep learning perspective
AGI safety from first principles: Introduction
The focus of the post is not on this fact (at least not in terms of the quantity of written material). I responded to the arguments made because they comprised most of the post, and I disagreed with them.
If the primary point of the post was “The presentation of AI x-risk ideas results in them being unconvincing to laypeople”, then I could find reason in responding to this, but other than this general notion, I don’t see anything in this post that expressly conveys why (excluding troubles with argumentative rigor, and the best way to respond to this I can think of is by refuting said arguments).
I don’t have an overarching theory of the Hard Problem of Jargon, but I have some guesses about the sorts of mistakes people love to make. My overarching point is just “things are hard”
This is a deeply rare phenomenon. I do think there are nonzero places with a peculiar mix of prestige and thinness of kayfabe that lead to this actually happening (like if you’re maintaining a polite fiction of meritocracy in the face of aggressive nepotism, you might rely on cheap superiority signals to nudge people into not calling BS), or in a different way I remember at when I worked at home depot supervisors may have been protecting their $2/hr pay bump by hiding their responsibilities from their subordinates (to prevent subordinates from figuring out that they could handle actually supervising if the hierarchy was disturbed). Generalizing from these scenarios to scientific disciplines is perfectly silly! Most people, a vaster majority in sciences, are extremely excited about thinking clearly and communicating clearly to as many people as possible!
I also want to point out a distinction you may be missing in anti-finance populism. A synthetic CDO is sketchy because it is needlessly complex by it’s nature, not that the communication strategy was insufficiently optimized! But you wrote about “unnecessary jargon”, implying that you think implementing and reasoning about synthetic CDOs is inherently easy, and finance workers are misleading people into thinking it’s hard (because of their scarcity mindset, to protect their job security, etc). Jargon is an incredibly weak way to implement anti-finance populism, a stronger form of it says that the instruments and processes themselves are overcomplicated (for shady reasons or whatever).
Moreover, emphasis on jargon complaints implies a destructive worldview. The various degrees and flavors of “there are no hard open problems, people say there are hard open problems to protect their power, me and my friends have all the answers, which were surprisingly easy to find, we’ll prove it to you as soon as you give us power” dynamics I’ve watched over the years seem tightly related, to me.
I do get frustrated when people tell me that “clear writing” is one thing that definitely exists, because I think they’re ignoring tradeoffs. “How many predictable objections should I address, is it 3? 6? does the ‘clear writing’ protocol tell me to roll a d6?” sort of questions get ignored. To be fair, Arbital was initially developed to be “wikipedia with difficulty levels”, which would’ve made this easier.
TLDR
I think the way people should reason about facing down jargon is to first ask “can I help them improve?” and if you can’t then you ask “have they earned my attention?”. Literally everywhere in the world, in every discipline, there are separate questions for communication at the state of the art and communication with the public. People calculate which fields they want to learn in detail, because effort is scarce. Saying “it’s a problem that learning your field takes effort” makes zero sense.
Preamble
I’ve ruminated about this for several days. As an outsider to the field of artificial intelligence (coming from a IT technical space, with an emphasis on telecom and large call centers which are complex systems where interpretability has long held significant value for the business org) I have my own perspective on this particular (for the sake of brevity) “problem.”
What triggered my desire to respond
For my part, I wrote a similarly sized article not for the purposes of posting, but to organize my thoughts. And then I let that sit. (I will not be posting that 2084 word response. Consider this my imitation of Pascal: I dedicated time to making a long response shorter.) However, this is one of the excerpts that I would like to extract from that my longer response:
This stood out to me, so I went to assess:
This article (at the time I counted it) ranked at 2398 words total.
Arbital Orthogonality article ranked at 2246 words total (less than this article.)
Arbital Instrumental Convergence article ranked at 3225 words total (more than this article.)
A random arxiv article I recently read for anecdotal comparison, ranked in at 9534 words (far more than this article.)
Likewise, the authors response to Eliezer’s short response stood out to me:
These elements provoke me to ask questions like:
Why does a request for brevity from Eliezer provoke concern?
Why does the author not apply their own evaluations on brevity to their article?
Can the authors point be made more succinctly?
These are rhetorical and are not intended to imply an answer, but it might give some sense of why I felt a need to write my own 2k words on the topic in order to organize my thoughts.
Observations
I observe that
Jargon, while potentially exclusive, can also serve as shorthand for brevity.
Presentation improvement seems to be the author’s suggestion to combat confirmation bias, belief perseverance and cognitive dissonance. I think the author is talking about boundaries. In Youtube: Machine Learning Street Talk: Robert Miles—“There is a good chance this kills everyone” offers what I think is a fantastic analogy for this problem—Someone asks an expert to provide an example of the kind of risk we’re talking about, but the risk example requires numerous assumptions be made for the example to have meaning, then, because the student does not already buy into the assumptions, they straw man the example by coming up with a “solution” to that problem and ask “Why is it harder than that?”—Robert gives a good analogy by saying this is like asking Robert what chess moves would defeat Magnus, but, in order for the answer to be meaningful, Robert requires more expertise at chess than Magnus. And when Robert comes up with a move that is not good, even a novice at chess might see a way to counter Robert’s move. These are not good engagements in the domain, because they rely upon assumptions that have not been agreed to, so there can be no short hand.
p(doom) is subjective and lacks systemization/formalization. I intuit that Availability heuristics plays a role. An analogy might be that if someone hears Eliezer express something that sounds like hyperbole, then they assess their p(doom) must be lower than his. This seems as if this is the application of confirmation bias to what appears to be a failed appeal to emotion. (i.e., you seem to have appealed to my emotion, but I didn’t feel the way you intended for me to feel, therefore I assume that I don’t believe the way you believe, therefore I believe your beliefs must be wrong.) I would caution that critics of Eliezer have a tendency to quote his more sensational statements out of context. Like quoting him about his “kinetic strikes on data centers” comment, without quoting the full context of the argument. You can find related twitter exchange and admissions that his proposal is an extraordinary one.
There may be still other attributes that I did not enumerate (I am trying to stay below 1k words.)[1]
Axis of compression potential
Which brings me to the idea that the following attributes are at the core of what the author is talking about:
Principal of Economy of Thought—The idea that truth can be expressed succinctly. This argument might also be related to Occam’s Razor. There are multiple examples of complex systems that can be described simply, but inaccurately, and accurately but not simply. Take the human organism, or the atom. And yet, there is a (I think) valid argument for rendering complex things down to simple, if inaccurate, forms so that they can be more accessible to students of the topic. Regardless of complexity required, trying to express something in the smallest form has utility. This is a principal I play with, literally daily, at work. However, when I offer an educational analogy, I often feel compelled to qualify that “All analogies have flaws.”
An improved sensitivity to boundaries in the less educated seems like a reasonable ask. While I think it is important to recognize that presentation alone may not change the mind of the student, it can still be useful to shape ones presentation to be less objectionable to the boundaries of the student. However, I think it important to remember that shaping an argument to an individuals boundaries is a more time consuming process and there is an implied impossibility of shaping every argument to the lowest common denominator. More complex arguments and conversation is required to solve the alignment problem.
Conclusion
I would like to close with, for the reasons the author uttered
I concur with this, and this alone puts my personal p(doom) at over 90%.
Do I think there is a solution? Absolutely.
Do I think we’re allocating enough effort and resources to finding it? Absolutely not.
Do I think we will find the solution in time? Given the propensity towards apathy, as discussed in the bystander effect I doubt it.
Discussion (alone) is not problem solving.[2] It is communication. And while communication is necessary in parallel with solution finding, it is not a replacement therefore.
So in conclusion, I generally support finding economic approaches to communication/education that avoid barrier issues, and I generally support promoting tailored communication approaches (which imply and require a large number of non-experts working collaboratively with experts to spread the message that risks exist with AI, and there are steps we can take to avoid risks, and that it is better to take steps before we do something irrevocable.)
But I also generally think that communication alone does not solve the problem. (Hopefully it can influence an investment in other necessary effort domains.)
I failed. This ranks in at 1240 words, including markdown.
Discussion is a likely requirement of problem solving, but I meant “non-problem solving” discussion. I am not intentionally equivocating here. (Lots of little edits for typographical errors, and mistakes with markdown.)
It’d be helpful to have a short summary of the post on LessWrong so there’s a bit more context on whether to click through.
Thank you! Outside perspectives from someone who’s bothered to spend their time looking at the arguments are really useful.
I’m disturbed that the majority of community responses seem defensive in tone. Responding to attempts at constructive criticism with defensiveness is a really bad sign for becoming Less Wrong.
I think the major argument missing from what you’ve read is that giving an AGI a goal that works for humanity is surprisingly really hard. Accurately expressing human goals, let alone as an RL training set, in a way that stays stable long-term one an AGI has (almost inevitably) escapes your control, is really difficult.
But that’s on the object level, which isn’t the point of your post. I include it as my suggestion for the biggest thing we’re leaving out in brief summaries of the arguments.
I think the community at large tends to be really good at alignment logic, and pretty bad at communicating succinctly with the world at large, and we had better correct this or it might get us all killed. Thanks so much for trying to push us in that direction!
This was a really good post, and I think accurately reflects a lot of people’s viewpoints. Thanks!
Most fields, especially technical fields, don’t do this. They use jargon because 1) the actual meanings the jargon points to don’t have short, precise, natural language equivalents, and 2) if experts did assign such short handles using normal language, the words and phrases used would still be prone to misunderstanding by non-experts because there are wide variations in non-technical usage, plus it would be harder for experts to know when their peers are speaking precisely vs. colloquially. In my own work, I will often be asked a question that I can figure out the overall answer to in 5 minutes, and I can express the answer and how I found it to my colleagues in seconds, but demonstrating it to others regularly takes over a day of effort organizing thoughts and background data and assumptions, and minutes to hours presenting and discussing it. I’m hardly the world’s best explainer, but this is a core part of my job for the past 12 years and I get lots of feedback indicating I’m pretty good at it.
I think this section greatly underestimates just how much hidden complexity (EY and other high-probability-of-doom-predictors say that) wishes have. It’s not so much, “a longer sentence with more caveats would have been fine,” but rather more like “the required complexity has never been able to be even close to achieved or precisely described in all the verbal musings and written explorations of axiology/morality/ethics/law/politics/theology/psychology that humanity has ever produced since the dawn of language.” That claim may well be wrong, but it’s not a small difference of opinion.
This is a disagreement over priors, not black boxes. I am much more than 90% certain that the interior of a black hole beyond the event horizon does not consist of a habitable environment full of happy, immortal, well-cared for puppies eternally enjoying themselves. I am also much more than 90% certain that if I plop a lump of graphite in water and seal it a time capsule for 30 years, that when I open it, it won’t contain diamonds and neatly-separated regions of hydrogen and oxygen gas. I’m not claiming anyone has that level of certainty of priors regarding AI x-risk, or even close. But if most possible good outcomes require complex specifications, that means there are orders of magnitude more ways for things to go wrong, than right. That’s a high bar for what level of caution and control is needed to steer towards good outcomes. Maybe not high enough to get to >90%, but high enough that I’d find it hard to be convinced of <10%. And my bar for saying “sure, let’s roll the dice on the entire future light cone of Earth” is way less than 10%.
Quite likely, depending on how you specify goals relating to humans[1], though it could wind up quite dystopic due to that hidden complexity.
I don’t think wireheading is a common argument in fact? It doesn’t seem like it would be a crux issue on doom probability, anyway. Self-modification on the other hand is very important since it could lead to expansion of capabilities.
I think this is a valid point—it will by default carry out something related to the goals it is trained to do[2], albeit with some mis-specification and mis-generalization or whatever, and I agree that mesa-optimizers are generally overrated. However I don’t think the following works to support that point:
I think Eliezer would turn that around on you and ask how you are so confident that the opaque black box has goals that fall into the narrow set that would work out for humans, when a vastly larger set would not?
Intelligence is not magic? Tell that to the chimpanzees...
We don’t really know how much room there is for software-level improvement; if it’s large, self-improvement could create far super-human capabilities in existing hardware. And with great intelligence comes great capabilities:
It will be superhumanly good at persuading humans even if that doesn’t lead to exactly “hack a human brain”
I think at least a substantial minority of humans might side with even an openly misaligned AI if they are convinced it will win, and through higher bandwidth and unified command the AI would be able to coordinate its supporters much better than the opponents can coordinate, and it could actively disrupt or subvert nominally opposing organizations through its agents
regarding experimentation, an AI may be able to substitute simulation. Its imagination need not be constrained by a human’s meagre working memory
these are just a few examples that mere human-level intelligence can think of. A superintelligence will likely have more options that a superintelligence can think of and I haven’t
Moreover, even if these things don’t work that way and we get a slow takeoff, that doesn’t necessarily save humanity. It just means that it will take a little longer for AI to be the dominant form of intelligence on the planet. That still sets a deadline to adequately solve alignment.
As alluded to before, there’s more ways for the AI to kill us than not to kill us.
My own doom percentage is lower than this, though not because of any disagreement with >90% doomers that we are headed to (at least dystopian if not extinction) doom if capabilities continue to advance without alignment theory also doing so. I just think the problems are soluble.
I think that this leads to the conclusion that some 101-level version could be made, and promoted for outreach purposes rather than the more advanced stuff. But that depends on outreach actually occurring—we still need to have the more advanced discussions, and those will provide the default materials if the 101-stuff doesn’t exist or isn’t known.
Yes, I do think that’s more in line with what a lot of the community actually believes, including me. But, I’m not sure why you’re saying in that case that “the community” overemphasizes >90%? Do you mean to say, for example, that certain members of the community (e.g. Eliezer) overemphasize >90%, and you think that those members are too prominent, at least from the perspective of outsiders?
I think, yes, perhaps Eliezer could be a better ambassador for the community or it would be better if someone else who would be better in that role took that role more. I don’t know if this is a “community” issue though?
I think Eliezer might be imagining that everything including goals relating to humans would ultimately be defined in relation to fundamental descriptions of the universe, because Solomonoff or something, and I would think such a definition would lead to certain doom unless unrealistically precise.
But IMO things like human values will have a large influence on AI data such that they should likely naturally abstract them (“grounding” in the input data but not necessarily in fundamental descriptions) so humans can plug in to those abstractions either directly or indirectly. I think it should be possible to safeguard against the AI redefining these abstractions under self-modification in terms that would undermine satisfying the original goals, and in any case I am skeptical that an optimal limited-compute Solomonoff approximator defines everything only in terms of fundamental descriptions at achievable levels of compute. Thus, I agree more with you than my imagining of Eliezer on this point. But maybe I am mis-imagining Eliezer.
A potentially crux-y issue that I also note is that Eliezer, I think, thinks we are stuck with what we get from the initial definition of goals in terms human values due to consequentialism (in his view) being a stable attractor. I think he is wrong on consequentialism[3] (about the attractor part, or at least the size of its attractor basin, but the stable part is right) and that self-correcting alignment is feasible.
However I do have concerns about agents arising from mostly tool-ish AI, such as:
- takeover of language model by agentic simulacra
- person uses language model’s coding capabilities to make bootstrapped agent
- person asks oracle AI what it could do to achieve some effect in the world, and its response includes insufficiently sanitized raw output (such as rewrite of its own code) that achieves that
Note that these are downstream, not upstream, of the AI’s fulfilling their intended goals. I’m somewhat less concerned of agents arising upstream or direct unintented agentification at the level of the original goals, but note that agentiness is something that people will be pushing for for capability reasons, and once a self-modifying AI is expressing agentiness at one point in time it will tend to self-modify if it can to follow that objective more consistently.
And by consequentialism, I really do mean consequentialism (goals directed at specific world states) and not utility functions, which is often confused with consequentialism in this community. Non-consequentialist utility functions are fine in my view! Note that the VNM theorem has the form (consequentialism (+ rationality) → utility function) and does not imply consequentialism is rational.
If a slow takeoff is all that’s possible, doesn’t that open up other options for saving humanity besides solving alignment?
I imagine far more humans will agree p(doom) is high if they see AI isn’t aligned and it’s growing to be the dominant form of intelligence that holds power. In a slow-takeoff, people should be able to realize this is happening, and effect non-alignment based solutions (like bombing compute infrastructure).
Intelligence is indeed not magic. None of the behaviors that you display that are more intelligent than a chimpanzee’s behaviors are things you have invented. I’m willing to bet that virtually no behavior that you have personally come up with is an improvement. (That’s not an insult, it’s simply par for the course for humans.) In other words, a human is not smarter than a chimpanzee.
The reason humans are able to display more intelligent behavior is because we’ve evolved to sustain cultural evolution, i.e., the mutation and selection of behaviors from one generation to the next. All of the smart things you do are a result of that slow accumulation of behaviors, such as language, counting, etc., that you have been able to simply imitate. So the author’s point stands that you need new information from experiments in order to do something new, including new kinds of persuasion.
I disagree with your objections.
This argument is essentially addressed by this post, and has many failure modes. For example, if you specify the superintelligence’s goal as the example you gave, it’s most optimal solution might be to cryopreserve the brain of every human in a secure location, and prevent any attempts an outside force could make at interfacing with them. You realize this, and so you specify something like “Make as many squiggles as possible whilst leaving humans in control of their future”, and the intelligence is quite smart and quite general, so it can comprehend the notion of what you want when you say “we want control of our future”, but then BayAreaAILab#928374 trains a superintelligence designed to produce squiggles without this limit and outcompetes the aligned intelligence, because humans are much less efficient than inscrutable matrices.
This is not even mentioning issues with inner alignment and mesa-optimizers. You start to address this with:
But I don’t feel as though your referencing to Eliezer’s Twitter loss drop fiasco and subsequent argument regarding GPU maximization successfully refutes claims regarding mesa-optimization. Even if GPU-maximizing mesa-optimization was intractable, what about the other potentially infinite number of possible mesa-optimizer configurations that result ?
When Eliezer talks about ‘brain hacking’ I do not believe he means by dint of a virtual reality headset. Psychological manipulation is an incredibly powerful tool, and who else could manipulate humanity if not a superintelligence? Furthermore, if said intelligence models humans via simulating strategies, which that post argues is likely assuming large capabilities gaps between humanity and a hypothetical superintelligence.
The analogy of “forecasting the temperature of the coffee in 5 minutes” VS “forecasting that if left the coffee will get cold at some point” seems relevant here. Without making claims about the intricacies of the future state of a complex system, you can make high-reliability inferences about their future trajectories in more general terms. This is how I see AI x-risk claims. If the claim was that there was a 90% chance that a superintelligence will render humanity extinct and it will have some architecture x I would agree with you, but feel as though Eliezer’s forecast is general enough to be reliable.
Thanks for your reply. I welcome an object-level discussion, and appreciate people reading my thoughts and showing me where they think I went wrong.
The hidden complexity of wishes stuff is not persuasive to me in the context of an argument that AI will literally kill everyone. If we wish for it not to, there might be some problems with the outcome, but it won’t kill everyone. In terms of Bay Area Lab 9324 doing something stupid, I think by the time thousands of labs are doing this, if we have been able to successfully wish for stuff without catastrophe being triggered, it will be relatively easy to wish for universal controls on the wishing technology.
“Infinite number of possible mesa-optimizers”. This feels like just invoking an unknown unknown to me, and then asserting that we’re all going to die, and feels like it’s missing some steps.
You’re wrong about Eliezer’s assertions about hacking, he 100% does believe by dint of a VR headset. I quote: “—Hack a human brain—in the sense of getting the human to carry out any desired course of action, say—given a full neural wiring diagram of that human brain, and full A/V I/O with the human (eg high-resolution VR headset), unsupervised and unimpeded, over the course of a day: DEFINITE YES—Hack a human, given a week of video footage of the human in its natural environment; plus an hour of A/V exposure with the human, unsupervised and unimpeded: YES ”
I get the analogy of all roads leading to doom, but it’s just very obviously not like that, because it depends on complex systems that are very hard to understand, and AI x-risk proponents are some of the biggest advocates of that opacity.
Soft upvoted your reply, but have some objections. I will respond using the same numbering system you did such that point 1 in my reply will address point 1 of yours.
I agree with this in the context of short-term extinction (i.e. at or near the deployment of AGI), but would offer that an inability to remain competitive and loss of control is still likely to end in extinction, but in a less cinematic and instantaneous way. In accordance with this, the potential horizon for extinction-contributing outcomes is expanded massively. Although Yudkowsky is most renowned for hard takeoff, soft takeoff has a very differently shaped extinction-space and (I would assume) is a partial reason for his high doom estimate. Although I cannot know this for sure, I would imagine he has a >1% credence in soft takeoff. ‘Problems with the outcome’ seem highly likely to extend to extinction given time.
There are (probably) an infinite number of possible mesa-optimizers. I don’t see any reason to assume an upper bound on potential mesa-optimization configurations, and yes; this is not a ‘slam dunk’ argument. Rather, as derived from the notion that even slightly imperfect outcomes can extend to extinction, I was suggesting that you are trying to search an infinite space for a quark that fell out of your pocket some unknown amount of time ago whilst you were exploring said space. This can be summed up as ‘it is not probable that some mesa-optimizer selected by gradient descent will ensure a Good Outcome’.
This still does not mean that the only form of brain hacking is via highly immersive virtual reality. I recall the Tweet that this comment came from, and I interpreted it as a highly extreme and difficult form of brain hacking used to prove a point (the point being that if ASI could accomplish this it could easily accomplish psychological manipulation). Eliezer’s breaking out of the sandbox experiments circa 2010 (I believe?) are a good example of this.
Alternatively you can claim some semi-arbitrary but lower extinction risk like 35%, but you can make the same objections to a more mild forecast like that. Why is assigning a 35% probability to an outcome more epistemologically valid than a >90% probability? Criticizing forecasts based on their magnitude seems difficult to justify in my opinion, and critiques should rely on argument only.
I disagree with OPs objections, too, but that’s explicitly not the point of this post. OP is giving us an outside take on how our communication is working, and that’s extremely valuable.
Typically, when someone says you’re not convincing them, “you’re being dumb” is itself a dumb response. If you want to convince someone of something, making the arguments clear is mostly your responsibility.
This is highly useful. Thank you so much for taking the time to write it!
It’s not worth debating the points you raise, since the point is you explaining to us where the explanation went wrong for you. That didn’t stop many people from doing it, of course :)
I agree strongly with your points about the communication style. It’s not possible to address every objection in a short piece, but it is possible to put forth the basic argument in clear and simple terms. I think the type of person who’s interested in AI safety typically isn’t focused on communicating with laypeople. And we need to get better.
Have you read https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message yet?
I had some similar thoughts to yours before reading that, but it helped me make a large update in favor of superintelligence being able to make magical-seeming feats of deduction. If a large number of smart humans working together for a long time can figure something out (without performing experiments or getting frequent updates of relevant sensory information), then a true superintelligence will also be able to.
I’ve got some object-level thoughts on Section 1.
It’d still need to do risk mitigation, which would likely entail some very high-impact power seeking behavior. There are lots of ways things could go wrong even if its preferences saturate.
For example, it’d need to secure against the power grid going out, long-term disrepair, getting nuked, etc.
The AI doesn’t need to change or even fully understand its own goals. No matter what its goals are, high-impact power seeking behavior will be the default due to needs like risk mitigation.
Figuring out sensible goals is only part of the problem, and the other parts of the problem are sufficient for alignment to be really hard.
In addition to the inner/outer alignment stuff, there is what John Wentworth calls the pointers problem. In his words: “I need some way to say what the values-relevant pieces of my world model are “pointing to” in the real world”.
In other words, all high-level goal specifications need to bottom out in talking about the physical world. That is… very hard and modern philosophy still struggles with it. Not only that, it all needs to be solved in the specific context of a particular AIs sensory suite (or something like that).
As a side note, the original version of the paperclip maximizer, as formulated by Eliezer, was partially an intuition pump about the pointers problem. The universe wasn’t tiled by normal paperclips, it was tiled by some degenerate physical realization of the conceptual category we call “paperclips” e.g. maybe a tiny strand of atoms that kinda topologically maps to a paperclip.
Agreed. Removing all/most constraints on expected futures is the classic sign of the worst kind of belief. Unfortunately, figuring out the constraints left after contending with superintelligence is so hard that it’s easier to just give up. Which can, and does, lead to magical thinking.
There are lots of different intuitions about what intelligence can do in the limit. A typical LessWrong-style intuition is something like 10 billion broad-spectrum geniuses running at 1000000x speed. It feels like a losing game to bet against billions of Einsteins+Machiavellis+(insert highly skilled person) working for millions of years.
Additionally, LessWrong people (myself included) often implicitly think of intelligence as systemized winning, rather than IQ or whatever. I think that is a better framing, but it’s not the typical definition of intelligence. Yet another disconnect.
However, this is all intuition about what intelligence could do, not what a fledgling AGI will probably be capable of. This distinction is often lost during Twitter-discourse.
In my opinion, a more generally palatable thought experiment about the capability of AGI is:
What could a million perfectly-coordinated, tireless copies of a pretty smart, broadly skilled person running at 100x speed do in a couple years?
Well… enough. Maybe the crazy sounding nanotech, brain-hacking stuff is the most likely scenario, but more mundane situations can still carry many of the arguments through.
I this feels like the right analogy to consider.
And in considering this thought experiment, I’m not sure trying to solve alignment is the only/best way to reduce risks. This hypothetical seems open to reducing risk by 1) better understanding how to detect these actors operating at large scale 2) researching resilient plug-pulling strategies
I think both of those things are worth looking into (for the sake of covering all our bases), but by the time alarm bells go off it’s already too late.
It’s a bit like a computer virus. Even after Stuxnet became public knowledge, it wasn’t possible to just turn it off. And unlike Stuxnet, AI-in-the-wild could easily adapt to ongoing changes.