Petrov is, as usual for heroes, tracking incoming missiles in his early warning command centre. The attack pattern seems unlikely, and he has decided not to inform his leaders about the possible attack.
His corrigible AI pipes up to check if he needs any advice. He decides he does, and asks the AI to provide him with documentation about computer detection malfunction. In the few minutes it has, the AI can start with one introductory text A, or with introductory text B. Predictably, if given A, Petrov will warn his superiors (and maybe set off a nuclear war), and, if given B, he will not.
If the corrigible AI says that it cannot answer, however, Petrov will decide to warn his superiors, as his thinking has been knocked off track by the conversation. Note that this is not what would have happened had the AI stayed silent.
What is the corrigible thing to do in this situation? Assume that the AI can predict Petrov’s choice for whatever action it itself can take.
Petrov corrigibility
After my example of problems with corrigibility, and Eliezer pointing out that sometimes corrigibility may involve saying “there is no corrigible action”, here’s a scenario where saying that may not be the optimal choice.
Petrov is, as usual for heroes, tracking incoming missiles in his early warning command centre. The attack pattern seems unlikely, and he has decided not to inform his leaders about the possible attack.
His corrigible AI pipes up to check if he needs any advice. He decides he does, and asks the AI to provide him with documentation about computer detection malfunction. In the few minutes it has, the AI can start with one introductory text A, or with introductory text B. Predictably, if given A, Petrov will warn his superiors (and maybe set off a nuclear war), and, if given B, he will not.
If the corrigible AI says that it cannot answer, however, Petrov will decide to warn his superiors, as his thinking has been knocked off track by the conversation. Note that this is not what would have happened had the AI stayed silent.
What is the corrigible thing to do in this situation? Assume that the AI can predict Petrov’s choice for whatever action it itself can take.