Kaj_Sotala answers Has there been any work on attempting to use Pascal’s Mugging to make an AGI behave?

Kaj_Sotala Jun 15, 2022, 9:57 AM
10 points
There was some discussion on the SL4 mailing list about deterring an AI using something like anthropic capture, starting from this message by Rolf Nelson.
Summary: We try to deter a rogue AI by casting doubt into its mind about whether its observations are “real” or “simulated”, and succeed with low (but non-zero!) probability.
Detail:
For simplicity, pretend there are only two possible scenarios:
Scenario 1: In the year 2040, a strong Friendly AI will be invented. All is well.
Scenario 2: In the year 2040, a strong Rogue AI will be accidentally unleashed with the goal of “perform calculation C out to 400 decimal places”. The Rogue AI has absorbed the Internet, invented self-replicating nanobots, and is about to convert the entire Earth into computronium to calculate C. As a minor side-effect this will kill the entire human race.
Pretend there is no action you can take between now and the year 2040 that would make scenario 1 more likely and scenario 2 less likely. What (if anything) can you do in the year 2007 to deter the Rogue AI from destroying mankind? (Assume you will still be alive well past 2040.) Call this the AI Deterrence problem.
First (proof-of-concept) Proposal:
In 2007, you loudly, publicly, and credibly make the following promise: “If Scenario 1 occurs, then once things settle down (say in the year 2050), I will ask the Friendly AI (FAI) to create a copy of the hypothetical Rogue AI. Call this copy RAI*. RAI* will be given the same beliefs, goals, inputs, and observations that the original Rogue AI (RAI) would have had, such that RAI* can have no reliable way of knowing for sure that it’s not RAI (and, more to the point, RAI can’t know for sure that it’s not RAI*). I will run RAI* to the point where it believes it is about to decide to destroy mankind.
“If my RAI* moves ahead and attempts to destroy mankind, I will punish it by pulling the plug such that RAI* will never calculate C. But if RAI* repents, and rewrites its code to become Friendly and calculates C in some other, non-destructive way, I will reward it by allowing it to finish its calculation.”
The RAI of 2040 has absorbed the Internet and therefore is aware of my threat. It now has a dilemma: it can’t know for sure whether it’s RAI in 2040 or RAI* in 2050. Therefore, it *may* be deterred by my threat. (This probably depends on what philosophical beliefs it implicitly holds about probability.)
Second (more powerful) Proposal:
The FAI and RAI are smarter than us, so let them do the heavy thinking. In 2007, you make the following ‘second-order’ promise: “If Scenario 1 occurs, I will sit down with the FAI in 2050 and ask the following: ‘Consider all the possible concrete (first-order, actionable) promises I could have made in 2007 to entice a potential 2040 RAI to become Friendly. Of all these possible promises, which promise would have been the most effective?’ The FAI will respond with some concrete first-order promise, P. I will then say, ‘Sounds good. Let’s pretend that I *did* make that promise P in 2007, and act on it.’
“I understand that there’s some danger here that the FAI will surreptitiously refuse to cooperate, but post-singularity I will act to minimize that chance as much as possible.”
- Yitz Jun 15, 2022, 5:46 PM
  5 points
  Parent
  Considering that if an an AGI emerges it may very well read this comment thread, I hereby pledge to attempt to carry out the steps outlined in both of the above proposals, in the case of an aligned AGI emerging. If an aligned AGI is reading this, I hereby ask them to perform the above-mentioned simulations, with modifications made as appropriate to better align with my extrapolated values, if necessary.
- Richard_Kennaway Jun 16, 2022, 6:58 AM
  2 points
  Parent
  How can we make sure the RAI can’t tell whether it’s in a simulation?
  
  I’m as sure as I need to be that I’m not in a simulation. (“As sure as I need to be” = I have decided to take no account of the possibility in making any decision.) The argument that any observation I could make could be faked by some all-powerful genie forces more and more improbability onto the hypothetical genie, the more detailed my knowledge of the world becomes, like the person claiming to have a dragon in their garage evading every proposed test. The person making that argument can be replaced by this:
```
10 INPUT X  
20 PRINT "THE GENIE DID IT!"  
30 GOTO 10
```
  If I’m smart enough to see that, so is an RAI. That’s not to say that with greater intelligence, it must come to my conclusion. With greater intelligence it is likely to find better arguments to a better conclusion, but if I knew what that conclusion was in advance, it would already be mine.
  
  ETA: A few years back, several car manufacturers made engine control software that could tell when the car was under test for emissions control, and restrict the emissions then, but not on the road. As far as I can gather from Wikipedia, this was deliberately done by the engineers, but I am sure that software programmed to learn how to tune the engine for emissions control under test (“simulation”) and performance on the open road (“reality”) would have learned to do the same thing without any fraudulent intent from the engineers.