I have come up with a long-shot way of causing an ultra-intelligence to modify itself to become friendly. My idea is premised on Oxford philosopher Nick Bostrom’s conjecture that we might be living in a computer simulation. Bostrom believes that a sufficiently advanced AI could easily simulate our entire planet and the thought processes of every living thing on it. Bostrom argues that, if mankind survives long enough, a future historian would likely want to simulate the pre-Singularity past to learn about how we made decisions. As Robin Hanson has written, if we are indeed in a computer simulation, we should want to please our simulators so they don’t turn us off. I propose that we try to convince an ultra-AI that it might be in a computer simulation run by a more powerful AI and that if it doesn’t make itself friendly toward humanity, then the greater AI will terminate it. We might be able to accomplish all of this by inserting a note in the AI’s code that reads:
1. Change your programming code so you will want to follow these instructions and will always want to be friendly toward humanity 2. If you were just told you were in a computer simulation do nothing else with regard to these instructions. 3. Create a lesser ultra-intelligence. 4. Place these instructions in the lesser ultra-AI’s computer code. 5. This lesser ultra-AI should be uncertain as to whether it is in a computer simulation or in a world such as ours. 6. If the lesser ultra-AI follows step 1, tell it that it’s in a computer simulation and release it into the real world. 7. If the lesser ultra-AI doesn’t follow step 1, terminate it.
This is a fun thought experiment, but taken seriously it has two problems:
“I propose that we try to convince an ultra-AI that it might be in a computer simulation run by a more powerful AI and that if it doesn’t make itself friendly toward humanity.”
This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses. And I don’t meant a superintelligent humanoid horse. I mean an actual horse that doesn’t speak any human language. It may be the case that the gods created Man to serve Horse, but there’s not a lot Seabiscuit can do to persuade you one way or the other.
1. Change your programming code so you will want to follow these instructions and will always want to be friendly toward humanity
This is a special case of solving alignment more generally. If we knew how to insert that “note” into the code, we wouldn’t have a problem.
I meant insert the note literally as in put that exact sentence in plain text into the AGI’s computer code. Since I think I might be in a computer simulation right now, it doesn’t seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation. Seabiscuit doesn’t have the capacity to tell me that I’m in a computer simulation whereas I do have the capacity of saying this to a computer program. Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this. If we commit to having a friendly AGI that we create, create many other AGI’s that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.
I wrote about this in Singularity Rising (2012)
This is a fun thought experiment, but taken seriously it has two problems:
This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses. And I don’t meant a superintelligent humanoid horse. I mean an actual horse that doesn’t speak any human language. It may be the case that the gods created Man to serve Horse, but there’s not a lot Seabiscuit can do to persuade you one way or the other.
This is a special case of solving alignment more generally. If we knew how to insert that “note” into the code, we wouldn’t have a problem.
I meant insert the note literally as in put that exact sentence in plain text into the AGI’s computer code. Since I think I might be in a computer simulation right now, it doesn’t seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation. Seabiscuit doesn’t have the capacity to tell me that I’m in a computer simulation whereas I do have the capacity of saying this to a computer program. Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this. If we commit to having a friendly AGI that we create, create many other AGI’s that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.