How about you ask the AI “if you were to ask a counterfactual version of you who lives in a world where the president died, what would it advise you to do?”. This counterfactual AI is motivated to take nice actions, so it would advise the real AI to take nice actions as well, right?
And even if it knew the correct answer to that question, how can you be sure it wouldn’t instead lie to you in order to achieve its real goals? You can’t really trust the AI if you are not sure it is nice or at least indifferent...
How about you ask the AI “if you were to ask a counterfactual version of you who lives in a world where the president died, what would it advise you to do?”. This counterfactual AI is motivated to take nice actions, so it would advise the real AI to take nice actions as well, right?
This counterfactual AI is motivated to take nice actions in worlds where the president died. It might not even know what “nice” means in other worlds.
And even if it knew the correct answer to that question, how can you be sure it wouldn’t instead lie to you in order to achieve its real goals? You can’t really trust the AI if you are not sure it is nice or at least indifferent...