Fwiw I don’t find the problem of fully updated deference very compelling. My real rejection of utility uncertainty in the superintelligent-god-AI scenario is:
It seems hard to create a distribution over utility functions that is guaranteed to include the truth (with non-trivial probability, perhaps). It’s been a while since I read it, but I think this is the point of Incorrigibility in the CIRL Framework.
It seems hard to correctly interpret your observations as evidence about utility functions. In other words, the likelihood p(obs∣utility fn) is arbitrary and not a fact about the world, and so there’s no way to ensure you get it right. This is pointed at somewhat by your first link.
If we somehow magically vanished away these problems, maximizing expected utility under that distribution seems fine, even though the resulting AI system would prevent us from shutting it down. It would be aligned but not corrigible.
I could imagine an efficient algorithm that could be said to be approximating a Bayesian agent with a prior including the truth, but I don’t say that with much confidence.
I agree with the second bullet point, but I’m not so convinced this is prohibitively hard. That said, not only would we have to make our (arbitrarily chosen) p(obs | utlity fn) un-game-able, one reading of my original post is that we would also have to ensure that by the time the agent was no longer continuing to gain much information, it would already have to have a pretty good grasp on the true utility function. This requirement might reduce to a concept like identifiability of the optimal policy.
Identifiability of the optimal policy seems too strong: it’s basically fine if my household robot doesn’t figure out the optimal schedule for cleaning my house, as long as it’s cleaning it somewhat regularly. But I agree that conceptually we would want something like that.
Fwiw I don’t find the problem of fully updated deference very compelling. My real rejection of utility uncertainty in the superintelligent-god-AI scenario is:
It seems hard to create a distribution over utility functions that is guaranteed to include the truth (with non-trivial probability, perhaps). It’s been a while since I read it, but I think this is the point of Incorrigibility in the CIRL Framework.
It seems hard to correctly interpret your observations as evidence about utility functions. In other words, the likelihood p(obs∣utility fn) is arbitrary and not a fact about the world, and so there’s no way to ensure you get it right. This is pointed at somewhat by your first link.
If we somehow magically vanished away these problems, maximizing expected utility under that distribution seems fine, even though the resulting AI system would prevent us from shutting it down. It would be aligned but not corrigible.
I could imagine an efficient algorithm that could be said to be approximating a Bayesian agent with a prior including the truth, but I don’t say that with much confidence.
I agree with the second bullet point, but I’m not so convinced this is prohibitively hard. That said, not only would we have to make our (arbitrarily chosen) p(obs | utlity fn) un-game-able, one reading of my original post is that we would also have to ensure that by the time the agent was no longer continuing to gain much information, it would already have to have a pretty good grasp on the true utility function. This requirement might reduce to a concept like identifiability of the optimal policy.
Identifiability of the optimal policy seems too strong: it’s basically fine if my household robot doesn’t figure out the optimal schedule for cleaning my house, as long as it’s cleaning it somewhat regularly. But I agree that conceptually we would want something like that.