I think the moral-uncertainty version of the problem is fatal unless you make further assumptions about how to resolve it, such as by fixing some arbitrary intertheoretic-comparison weights (which seems to be what you’re suggesting) or using the parliamentary model.
Brian_Tomasik
Currently I don’t care much about strongly positive events, so at this point I’d say no. In the throes of such a positive event I might change my mind. :)
Yes, because I don’t see any significant selfish upside to life, only possible downside in cases of torture/etc. Life is often fun, but I don’t strongly care about experiencing it.
Yeah, but it would be very bad relative to my altruistic goals if I died any time soon. The thought experiment in the OP ignores altruistic considerations.
However, if you believe that the agent in world 2 is not an instantiation of you, then naturalized induction concludes that world 2 isn’t actual and so pressing the button is safe.
By “isn’t actual” do you just mean that the agent isn’t in world 2? World 2 might still exist, though?
I assume the thought experiment ignores instrumental considerations like altruistic impact.
For re-living my actual life, I wouldn’t care that much either way, because most of my experiences haven’t been extremely good or extremely bad. However, if there was randomness, such that I had some probability of, e.g., being tortured by a serial killer, then I would certainly choose not to repeat life.
Is it still a facepalm given the rest of the sentence? “So, s-risks are roughly as severe as factory farming, but with an even larger scope.” The word “severe” is being used in a technical sense (discussed a few paragraphs earlier) to mean something like “per individual badness” without considering scope.
Thanks for the feedback! The first sentence below the title slide says: “I’ll talk about risks of severe suffering in the far future, or s-risks.” Was this an insufficient definition for you? Would you recommend a different definition?
I guess you mean that the AGI would care about worlds where the explosives won’t detonate even if the AGI does nothing to stop the person from pressing the detonation button. If the AGI only cared about worlds where the bomb didn’t detonate for any reason, it would try hard to stop the button from being pushed.
But to make the AGI care about only worlds where the bomb doesn’t go off even if it does nothing to avert the explosion, we have to define what it means for the AGI to “try to avert the explosion” vs. just doing ordinary actions. That gets pretty tricky pretty quickly.
Anyway, you’ve convinced me that these scenarios are at least interesting. I just want to point out that they may not be as straightforward as they seem once it comes time to implement them.
Fair enough. I just meant that this setup requires building an AGI with a particular utility function that behaves as expected and building extra machinery around it, which could be more complicated than just building an AGI with the utility function you wanted. On the other hand, maybe it’s easier to build an AGI that only cares about worlds where one particular bitstring shows up than to build a friendly AGI in general.
I’m nervous about designing elaborate mechanisms to trick an AGI, since if we can’t even correctly implement an ordinary friendly AGI without bugs and mistakes, it seems even less likely we’d implement the weird/clever AGI setups without bugs and mistakes. I would tend to focus on just getting the AGI to behave properly from the start, without need for clever tricks, though I suppose that limited exploration into more fanciful scenarios might yield insight.
As I understand it, your satisficing agent has essentially the utility function min(E[paperclips], 9). This means it would be fine with a 10^-100 chance of producing 10^101 paperclips. But isn’t it more intuitive to think of a satisficer as optimizing the utility function E[min(paperclips, 9)]? In this case, the satisficer would reject the 10^-100 gamble described above, in favor of just producing 9 paperclips (whereas a maximizer would still take the gamble and hence would be a poor replacement for the satisficer).
A satisficer might not want to take over the world, since doing that would arouse opposition and possibly lead to its defeat. Instead, the satisficer might prefer to request very modest demands that are more likely to be satisfied (whether by humans or by an ascending uncontrolled AI who wants to mollify possible opponents).
If there were a perfect correlation between choosing to one-box and having the one-box gene (i.e., everyone who one-boxes has the one-box gene, and everyone who two-boxes has the two-box gene, in all possible circumstances), then it’s obvious that you should one-box, since that implies you must win more. This would be similar to the original Newcomb problem, where Omega also perfectly predicts your choice. Unfortunately, if you really will follow the dictates of your genes under all possible circumstances, then telling someone what she should do is useless, since she will do what her genes dictate.
The more interesting and difficult case is when the correlation between gene and choice isn’t perfect.
(moved comment)
I assume that the one-boxing gene makes a person generically more likely to favor the one-boxing solution to Newcomb. But what about when people learn about the setup of this particular problem? Does the correlation between having the one-boxing gene and inclining toward one-boxing still hold? Are people who one-box only because of EDT (even though they would have two-boxed before considering decision theory) still more likely to have the one-boxing gene? If so, then I’d be more inclined to force myself to one-box. If not, then I’d say that the apparent correlation between choosing one-boxing and winning breaks down when the one-boxing is forced. (Note: I haven’t thought a lot about this and am still fairly confused on this topic.)
I’m reminded of the problem of reference-class forecasting and trying to determine which reference class (all one-boxers? or only grudging one-boxers who decided to one-box because of EDT?) to apply for making probability judgments. In the limit where the reference class consists of molecule-for-molecule copies of yourself, you should obviously do what made the most of them win.
Paul’s site has been offline since 2013. Hopefully it will come back, but in the meanwhile, here are links to most of his pieces on Internet Archive.
Good point. Also, in most multiverse theories, the worst possible experience necessarily exists somewhere.
From a practical perspective, accepting the papercut is the obvious choice because it’s good to be nice to other value systems.
Even if I’m only considering my own values, I give some intrinsic weight to what other people care about. (“NU” is just an approximation of my intrinsic values.) So I’d still accept the papercut.
I also don’t really care about mild suffering—mostly just torture-level suffering. If it were 7 billion really happy people plus 1 person tortured, that would be a much harder dilemma.
In practice, the ratio of expected heaven to expected hell in the future is much smaller than 7 billion to 1, so even if someone is just a “negative-leaning utilitarian” who cares orders of magnitude more about suffering than happiness, s/he’ll tend to act like a pure NU on any actual policy question.
Short answer:
Donate to MIRI, or split between MIRI and GiveWell charities if you want some fuzzies for short-term helping.
Long answer:
I’m a negative utilitarian (NU) and have been thinking since 2007 about the sign of MIRI for NUs. (Here’s some relevant discussion.) I give ~70% chance that MIRI’s impact is net good by NU lights and ~30% that it’s net bad, but given MIRI’s high impact, the expected value of MIRI is still very positive.
As far as your question: I’d put the probability of uncontrolled AI creating hells higher than 1 in 10,000 and the probability that MIRI as a whole prevents that from happening higher than 1 in 10,000,000. Say such hells used 10^-15 of the AI’s total computing resources. Assuming computing power to create ~10^30 humans for ~10^10 years, MIRI would prevent in expectation ~10^18 hell-years. Assuming MIRI’s total budget ever is $1 billion (too high), that’s ~10^9 hell-years prevented per dollar. Now apply rigorous discounts to account for priors against astronomical impacts and various other far-future-dampening effects. MIRI still seems very promising at the end of the calculation.
The naive form of the argument is the same between the classic and moral-uncertainty two-envelopes problems, but yes, while there is a resolution to the classic version based on taking expected values of absolute rather than relative measurements, there’s no similar resolution for the moral-uncertainty version, where there are no unique absolute measurements.