The problem is more about corrigibility—making an AI that will shut down if you ask it to. The idea is that uncertainty about utility function isn’t a good way to implement corrigibility. When you tell the AI to shut down because you know more about the true utility function, it might reply “nah, I won’t shut down, I’ve learned enough to optimize the true utility now”.
The problem is more about corrigibility—making an AI that will shut down if you ask it to. The idea is that uncertainty about utility function isn’t a good way to implement corrigibility. When you tell the AI to shut down because you know more about the true utility function, it might reply “nah, I won’t shut down, I’ve learned enough to optimize the true utility now”.