Better a Brave New World than a dead one
[Note: There be massive generalizations ahoy! Please take the following with an extremely large grain of salt, as my epistemic status on this one is roughly đ¤¨.]
What are the odds of actually solving the alignment problem (and then implementing it in a friendly AGI!) before itâs too late? Eliezer Yudkowsky and Nick Bostrom both seem to agree that we are likely doomed when it comes to creating friendly AGI (as per Scott Alexanderâs reading of this discussion) before the Paperclip Maximizers arrive. Of course, the odds being stacked against us is no excuse for inaction. Indeed, the community is working harder than ever, despite (what I perceive to be) a growing sense of pessimism regarding our ultimate fate. We should plan on doing everything possible to make sure that AGI, when developed, will be aligned with the will of its creators. However, we need to look at the situation pragmatically. There is a significant chance we wonât fully succeed at our goals, even if we manage to implement some softer safeguards. There is also a chance that we will fail completely. We need to at least consider some backup plans, the AGI equivalent of a âbreak glass in case of fireâ signâwe never want to be in a case where the glass must be broken, but under some extremely suboptimal conditions, a fire extinguisher will save lives. So letâs say the fire alarm starts to ring (if we are lucky), and the sprinklers havenât been installed yet. What then?
One possible answer: Terrorism! (Please note that terrorism is NOT actually the answer, and I will argue against myself in a few sentences. If this puts me on a list somewhereâoops.) Given that weâre talking about a world in which the alignment problem is not solved by the time weâve reached Singularity (which for the sake of discussion letâs assume will indeed happen for now), we will not be able to trust any sufficiently advanced AI with significant confidence. Even indirect transfer of information with an unaligned AGI could be a massive infohazard, no matter how careful we think weâre being. The safest route at that point would seemingly be to destroy any and all AGI. The only problem is that by the time superintelligence arises (which may or may not coincide with the first AGIs), we will be outsmarted, and any action on our part is likely to be too late. Instead, a preemptive strike against any and all possible methods of achieving AGI would seem necessary, if one believes that the only options are between fully-aligned AGI and the end of the world. In order to do this one might decide to try to brick all CPUs, take over the world, even end civilization on Earth to buy the universe time, or commit other acts of terror in order to ensure the Singularity does not come to pass. Unfortunately(?) for a would-be terroristic rationalist, such a strategy would be quite useless. Stopping technological progress is merely a delaying tactic, and if the alarm bells arenât ringing so loudly that everyone on Earth can hear it (at which point it would almost certainly be far too late for any human action to stop whatâs coming), terroristic action would vastly increase the likelihood that when AGI is developed, it will not happen with our input in mind. No matter how important you think your cause is for breaking the internet, good luck explaining that to anyone outside your ingroup, or your local police force for that matter. So KILL ALL ROBOTS is effectively out (unless you plan on breaking more than just the Overton window). What else can we do? One possibility is to try to aim for a potentially more achievable middle ground between fully aligned and antagonistic AGI. After all, there are many different degrees of terrible when it comes to the phase space of all possible Singularities. For instance, given the three-way choice between experiencing nothing but excruciating pain forever, nothing but unending pleasure, or nothing at all, Iâm fairly sure the majority of humans would choose pleasure, with perhaps a smaller minority choosing nothing (for philosophical or religious reasons), and only a tiny number of people choosing excruciating pain (although presumably they would immediately regret it, so per extrapolated volition, perhaps that shouldnât count at all). As it happens, the (very rough) models we have of what runaway AGI might do with humanity tend to fall under those three categories fairly neatly. Take three examples, collected somewhat randomly from my memory of past discussions on Less Wrong:
An AI optimized for expressed human happiness might end up simply drugging all humans to continuously evoke expressions of bliss, a dystopian world in which we would nonetheless likely not be unhappy, and may in fact experience a genuinely positive subjective state. This would not be a world many would choose to live in (although those in such a world would invariably claim to prefer their experience to other possible ones), but would almost certainly be better than total human extinction.
An AI optimized to maximize Paperclips would likely convert us into Paperclips eventually, which would result in temporary suffering, but ultimately cause total human extinction.
An AI optimized to maximize the number of living humans in existence would likely try to end up creating a matrix-like endless array of humans-in-a-tank, giving the absolute minimum required to keep them alive, with no concern for mental well-being. This would likely be a worse-than-death situation for humanity. Extinction would arguably be preferable over living in such a world.
If we have the chance to create an AGI which we know will be poorly aligned, and if inaction is not an option for whatever reason, it seems clear that itâs better to try to steer it closer to option 1 than 2 or 3, even at the cost of a progressive future for humanity. It should be reiterated that this strategy is only of relevance if everything else fails, but that does not mean that we shouldnât be prepared for such a possibility.
EDIT: I do not actually think that we should try to build an AI which will drug us to a questionably pleasurable mindless oblivion. Rather, the above post is meant to function as a parable of sorts, provoking readers into contemplating what a contingency plan for a âless horribleâ partially aligned AGI might look like. Please do not act on this post without significant forethought.
- Mar 1, 2022, 8:59 PM; 6 points) 's comment on Late 2021 MIRI ConÂverÂsaÂtions: AMA /â Discussion by (
The three options do seem much like shades of the same thing (various forms of death):
Might be preferable to 1 in a multiverse as you might simply might find yourself not experiencing those âbranchesâ.
Seems really bad, however itâs not even clear that those humans would even be that conscious enough to sustain a self which might actually lower the âmoral statusâ (or how much negative utility you want to ascribe to it), as that requires some degree of social interaction.
Is better than 3, but by how much, itâs not obvious if the selves of these blissed out humans wouldnât dissolve in the same way as in 3 (think what solitary confinement does to a mind, or what would happen if you increased the learning rate to a neural network high enough?)
So to me these all seem to be various shades of death in various forms. I might prefer 2 because I do expect a multiverse.
I would propose that thereâs some ways of shaping the rewards for a potential AGI that while not safe by the AI safetyâs communityâs standards, nor aligned in a âdoes what you want and nothing elseâ sense, it might give a much higher chance of positive outcomes than these examples despite still being a gamble: a curiosity drive, for example see OpenAIâs âLarge Scale Curiosityâ paper, I would also argue that GPT-nâs do end up with something similar by default, without fine tuning or nudging it in that direction with RL.
Why a curiosity drive (implemented as intrinsic motivation, based on prediction error)? I believe itâs likely that a lot of our complex values are due to this and a few other special things (empathy) as they interact with our baser biological drives and the environment. I also believe that having such a drive might be essential if weâd want future AGIs to be peers we cooperate withâand if by some chance it doesnât work out and we go extinct, at the very least it might result in an AGI civilization with relatively rich experiences, so it wouldnât be a total loss from some sort of utilitarian perspective.
My initial reply to this was rather long-winded, rambling and detailed, it had justifications for those beliefs, but it was deemed inappropriate for a front page post comment, so Iâve posted it to my shortform if youâd like to see my full answer (or should appear there if it gets past the spam filter).
First Iâd say that utilitarianism is dangerous, and Iâm glad you understand that.
But so are simplistic ideas. We have no idea how a forced-drugged future would turn out. For countless reasons, like a) humans donât like being in intense pleasure all the time, we like stability with a few spikes, b) things could go wrong in some other way with nothing we could do.
In short, things could still go very wrong in option 1. Any outcome where the world is drastically altered and our freedom drastically curtailed is a very dangerous outcome, or might turn out much worse than it initially appears.
We wonât solve alignment in time. I think we need to invest more in political solutions or achieving a higher degree or world/âtech control through other technologies like narrow AI itself. (With this Iâm not in any way diminishing alignment work, we must do what we can obviously.)
Thanks for the feedback, I appended an edit to the end of the article which should help clarify my position and goals here. I agree with you that alignment is unlikely to be solved in time. What Iâm ultimately trying to say is that alignment is not an all-or-nothing problem, as seems to be generally assumed by the community. I presented a range of possible outcomes above (all of which are more or less apocalyptic, unfortunately) which should be fairly easy to align an AGI towards, compared to the frankly utopian goals of current researchers in the field. I fully expect experts to be able to create at least slightly less dystopia-minded AGIs than I can imagine in an afternoon, but I donât think that aiming for perfect alignment will be very productive, unless timelines are vastly longer than most suspect. To give an analogy, itâs like preparing a fire safety system which can automatically put out any fires as soon as they start, vs. a system which simply gives people inside the building enough time to evacuate should a fire be reported. If youâre working on a deadline before a fire is predicted to appear, and the âsaferâ version will take longer than the deadline to install, than working on that version wonât save any lives, while the shoddier version will (although it may not save the building) be far more important to design.
âEDIT: I do not actually think that we should try to build an AI which will drug us to a questionably pleasurable mindless oblivion. Rather, the above post is meant to function as a parable of sorts, provoking readers into contemplating what a contingency plan for a âless horribleâ partially aligned AGI might look like. Please do not act on this post without significant forethought.â
I understood! What I meant is that we might end up tricked, when choosing the âleast bad optionâ. Of course itâs also true that thereâs multiple levels on the scale of success/âfailure, but when youâre dealing with something that will be re-writing its own code most likely, things that donât look too bad on paper might end up looking worse. Simplistic thoughts and choices might end up helping this. While I know that your situation 1 is a parable (though I canât tell how realistic) it still seems like a simplistic thought that for instance, in my opinion, would actually be worse than extinction. Iâd rather humanity go extinct than us being drugged all the time. It wouldnât be an acceptable life! Pleasure is not happiness, and too much fries you up. A less half-baked example could be: drug humans but not with the goal of pleasure, rather mental stability with a few normal pleasure spikes (AND a few other safeguards on top of that).
But if your examples were just cartoonish then I take back what I said.
But of course itâs still true that there are multiple levels of failure, and some might be preferable to extinction. Eliezerâs âfailed utopia 4-2â post seems like a pretty good example.
âI presented a range of possible outcomes above (all of which are more or less apocalyptic, unfortunately) which should be fairly easy to align an AGI towards, compared to the frankly utopian goals of current researchers in the field.â
The question is that without perfect alignment, we really couldnât be sure to direct AI to even one of your 3 examples which you consider easyâtheyâre probably not. Not without âperfect alignmentâ, because an advanced AI will probably be constantly re-writing its own code (recursive self improvement), so there are no certainties. We need formal proof of safety/âcontrol.
However, there might also be some weight to your side of the argument. Iâm not an alignment researcher, Iâve just read a few books on the matter, so maybe someone with more expertise could give their 2 cents...
(I actually do hope thereâs some weight to your argument, because it would make things much easier, and the future much less grim!)
Iâve heard people talk about failure modes, like âif things get too grim, engage paperclipper modeâ, something like that lol. So there are worst-case scenario safeguards. But to the point that a more ârealisticâ alignment would be feasible, I still have my doubts.
If an advanced AI is editing its own code, it would only do so as part of its internal utility function, which it will want to keep stable (since changing its utility function would make achieving it much less likely). Thereforeâat least as far as I can tellâwe only need to worry about the initial utility function we assign it.
Iâd place significant probability on us living in a world where a large chunk of alignment failures end up looking vaguely like one of the three examples I brought up, or at least convergent on a relatively small number of âattractors,â if you will.