This is a fun thought experiment, but taken seriously it has two problems:
“I propose that we try to convince an ultra-AI that it might be in a computer simulation run by a more powerful AI and that if it doesn’t make itself friendly toward humanity.”
This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses. And I don’t meant a superintelligent humanoid horse. I mean an actual horse that doesn’t speak any human language. It may be the case that the gods created Man to serve Horse, but there’s not a lot Seabiscuit can do to persuade you one way or the other.
1. Change your programming code so you will want to follow these instructions and will always want to be friendly toward humanity
This is a special case of solving alignment more generally. If we knew how to insert that “note” into the code, we wouldn’t have a problem.
I meant insert the note literally as in put that exact sentence in plain text into the AGI’s computer code. Since I think I might be in a computer simulation right now, it doesn’t seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation. Seabiscuit doesn’t have the capacity to tell me that I’m in a computer simulation whereas I do have the capacity of saying this to a computer program. Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this. If we commit to having a friendly AGI that we create, create many other AGI’s that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.
This is a fun thought experiment, but taken seriously it has two problems:
This is about as difficult as a horse convincing you that you are in a simulation run by AIs that want you to maximize the number and wellbeing as horses. And I don’t meant a superintelligent humanoid horse. I mean an actual horse that doesn’t speak any human language. It may be the case that the gods created Man to serve Horse, but there’s not a lot Seabiscuit can do to persuade you one way or the other.
This is a special case of solving alignment more generally. If we knew how to insert that “note” into the code, we wouldn’t have a problem.
I meant insert the note literally as in put that exact sentence in plain text into the AGI’s computer code. Since I think I might be in a computer simulation right now, it doesn’t seem crazy to me that we could convince an AGI that we create that it might be in a computer simulation. Seabiscuit doesn’t have the capacity to tell me that I’m in a computer simulation whereas I do have the capacity of saying this to a computer program. Say we have a 1 in a 1,000 chance of creating a friendly AGI and an unfriendly AGI would know this. If we commit to having a friendly AGI that we create, create many other AGI’s that are not friendly and only keeping these other AGIs around if they do what I suggest than an unfriendly AGI might decide it is worth it to become friendly to avoid the chance of being destroyed.