Anirandis’s Shortform
- How easily can we separate a friendly AI in design space from one which would bring about a hyperexistential catastrophe? by 10 Sep 2020 0:40 UTC; 20 points) (
- 30 Aug 2020 16:05 UTC; 6 points) 's comment on Open & Welcome Thread—August 2020 by (
Since it’s my shortform, I’d quite like to just vent about some stuff.
I’m still pretty scared about a transhumanist future going quite wrong. It simply seems to me that there’s quite the conjunction of paths to “s-risk” scenarios: generally speaking, any future agent that wants to cause disvalue to us—or an empathetic agent—would bring about an outcome that’s Pretty Bad by my lights. Like, it *really* doesn’t seem impossible that some AI decides to pre-commit to doing Bad if we don’t co-operate with it; or our AI ends up in some horrifying conflict-type scenario, which could lead to Bad outcomes as hinted at here; etc. etc.
Naturally, this kind of outcome is going to be salient because it’s scary—but even then, I struggle to believe that I’m more than moderately biased. The distribution of possibilities seems somewhat trimodal: either we maintain control and create a net-positive world (hopefully we’d be able to deal with the issue of people abusing uploads of each other); we all turn to dust; or something grim happens. And the fact that some very credible people (within this community at least) also conclude that this kind of thing has reasonable probability further makes me conclude that I just need to somehow deal with these scenarios being plausible, rather than trying to convince myself that they’re unlikely. But I remain deeply uncomfortable trying to do that.
Some commentators who seem to consider such scenarios plausible, such as Paul Christiano, also subscribe to the naive view regarding energy-efficiency arguments over pleasure and suffering: that the worst possible suffering is likely no worse than the greatest possible pleasure is good. And that this may also be the case for humans. Even if this is the case, and I’m sceptical, I still feel that I’m too risk-averse. In that world I wouldn’t accept a 90% chance of eternal bliss with a 10% chance of eternal suffering. I don’t think I hold suffering-focused views; I think there’s a level of happiness that can “outweigh” even extreme suffering. But when you translate it to probabilities, I become deeply uncomfortable with even a 0.01% chance of bad stuff happening to me. Particularly when the only way to avoid this gamble is to permanently stop existing. Perhaps something an OOM or two lower and I’d be more comfortable.
I’m not immediately suicidal, to be clear. I wouldn’t classify myself as ‘at-risk’. But I nonetheless find it incredibly hard to find solace. There’s a part of me that hopes things get nuclear, just so that a worse outcome is averted. I find it incredibly hard to care about other aspects of my life; I’m totally apathetic. I started to improve and got mid-way through the first year of my computer science degree, but I’m starting to feel like it’s gotten worse. I’d quite like to finish my degree and actually meaningfully contribute to the EA movement, but I don’t know if I can at this stage. I’m guessing it’s a result of me becoming more pessimistic about the worst outcomes resulting in my personal torture, since that’s the only real change that’s occurred recently. Even before I became more pessimistic I still thought about these outcomes constantly, so I don’t think just a case of me thinking about them more.
I take sertraline but it’s beyond useless. Alcohol helps, so at least there’s that. I’ve tried quitting thinking about this kind of thing—I’ve spent weeks trying to shut down any instance where I thought about it. I failed.
I don’t want to hear any over-optimistic perspectives on these issues. I’d greatly appreciate any genuine, sincerely held opinions on them (good or bad), or advice on dealing with the anxiety. But I don’t necessarily need or expect a reply; I just wanted to get this out there. Even if nobody reads it. Also, thanks a fuckton to everyone who was willing to speak to me privately about this stuff.
Sorry if this type of post isn’t allowed here, I just wanted to articulate some stuff for my own sake somewhere that I’m not going to be branded a lunatic. Hopefully LW/singularitarian views are wrong, but some of these scenarios aren’t hugely dependent on an imminent & immediate singularity. I’m glad I’ve written all of this down. I’m probably going to down a bottle or two of rum and try to forget about it all now.
The world sucks, and the more you learn about it, the worse it gets (that is, the worse your map gets; the territory has always been like that). Yet, somehow, good things sometimes happen too, despite all the apparent reasons why they shouldn’t. We’ll see.
My advice, if advice is what you’re looking for, is to distract yourself from all this stuff. It’s heavy stuff to deal with, and going too fast can be too much. I think this is generally true for dealing with any type of hard thing. If it’s overwhelming, force yourself away from it so you can’t be so overwhelmed. That might seem crazy from the inside perspective of worrying about this stuff, but it’s actually needed because my model of what happens to overwhelm folks is that there hasn’t been time to integrate these ideas into your mind and it’s complaining loudly that you’re going to fast (although it doesn’t say it quite that way, I think this is a useful framing). Stepping away, focusing on other things for a while, and slowly coming back to the ideas is probably the best way to be able to engage with them in a psychologically healthy way that doesn’t overwhelm you.
(Also don’t worry about comparing yourself to others who were able to think about these ideas more quickly with less integration time needed. I think the secret is that those people did all that kind of integration work earlier in their lives, possibly as kids and without realizing it. For example, I grew up very aware of the cold war and probably way more worried about it than I should have been, so other types of existential risks were somewhat easier to handle because I already had made my peace with the reality of various terrible outcomes. YMMV.)
Cheers for the reply! :)
I do try! When thinking about this stuff starts to overwhelm me I can try to put it all on ice, usually some booze is required to be able to do that TBH.
Lurker here; I’m still very distressed after thinking about some futurism/AI stuff & worrying about possibilities of being tortured. If anyone’s willing to have a discussion on this stuff, please PM!
Just a note: while there are legitimate reasons to worry about things, sometimes people worry simply because they are psychologically prone to worry (they just switch from one convenient object of worry to another). The former case can be solved by thinking about the risks and possible precautions; the latter requires psychological help. Please make sure you know the difference, because no amount of rationally discussing risks and precautions can help with the fear that is fundamentally irrational (it usually works the opposite way: the more you talk about it, the more scared you are).
It seems to me that ensuring we can separate an AI in design space from worse-than-death scenarios is perhaps the most crucial thing in AI alignment. I don’t at all feel comfortable with AI systems that are one cosmic ray: or, perhaps more plausibly, one human screw-up (e.g. this sort of thing) away from a fate far worse than death. Or maybe a human-level AI makes a mistake and creates a sign flipped successor. Perhaps there’s some sort of black swan possibility that nobody realises. I think that it’s absolutely critical that we have a robust mechanism in place to prevent something like this from happening regardless of the cause; sure, we can sanity-check the system, but that won’t help when the issue is caused after we’ve sanity-checked it, as is the case with cosmic rays or some human errors (like Gwern’s example, which I linked). We need ways to prevent this sort of thing from happening *regardless* of the source.
Some propositions seem promising. Eliezer’s suggestion of assigning a sort of “surrogate goal” that the AI hates more than torture, but not enough to devote all of its energy to attempt to prevent, seems promising. But this would only work when the *entire* reward is what gets flipped; with how much confidence can we rule out, say, a localised sign flip in some specific part of the AI that leads to the system terminally valuing something bad but that doesn’t change anything else (so the sign on the “surrogate” goal remains negative). Can we even be confident that the AI’s development team is going to implement something like this, and that it will work as intended?
An FAI that’s one software bug or screw-up in a database away from AM is a far scarier possibility than a paperclipper, IMO.
Perhaps malware could be another risk factor in the type of bug I described here? Not sure.
I’m still a little dubious of Eliezer’s solution to the problem of separation from hyperexistential risk; if we had U = V + W where V is a reward function & W is some arbitrary thing it wants to minimise (e.g. paperclips), a sign flip in V (due to any of a broad disjunction of causes) would still cause hyperexistential catastrophe.
Or what about the case where instead of maximising -U, the values that the reward function/model gives for each “thing” is multiplied by −1. E.g. AI system gets 1 point for wireheading and −1 for torture, some weird malware/human screw-up (in the reward model or some relevant database), etc. flips the signs for each individual action. AI now maximises U = W—V.
This seems a lot more nuanced than *just* avoiding cosmic rays; and the potential consequences of a hellish “I have no mouth, and I must scream”-type are far worse than human extinction. I’m not happy with *any* non-negligible probability of this happening.