I understand that one solution to AI alignment would be to build an agent with uncertainty about its utility function, so that by observing the environment and in particular us, it can learn our true utility function and optimize for that. And according to the problem of fully updated deference, trying to accomplish this would not significantly simplify our work because it involves two steps:
1) Learning our true utility function V (easy step)
this “merely” consists of knowing more about the world (if a perfect description W of the universe, which contains our true utility V somewhere in it, could be fed into the agent, then this step would be complete)
even a permanently misaligned agent would try to learn V for instrumental reasons (e.g. fooling us into thinking it is on our side)
2) Actually optimizing for V (hard step)
this requires that its meta utility function looks at its model of the world W, and uses a hardcoded procedure P that reliably points to the object that is our utility function V (if P is misspecified, the agent ends up optimizing V′≠V)
P is a probability distribution over utility functions that depends on W (that is, it’s a way for the agent to “update” on its own utility function)
there is no “universal” way to specify P (an alien race would specify a different P, which means they would specify a different update rule, whereas e.g. the rules of Bayesian updating would be the same from star to star)
specifying P may be easier than specifying V, but is still hard (e.g. it may require that the programmers know in advance how to define a human so that the agent will be able to find the human objects in its world model and extract their V)
the agent would oppose attempts to modify P, much like agents who are certain about their utility function oppose attempts to modify it (the problem reproduces at the meta level)
Am I understanding the problem of fully updated deference correctly?
I understand that one solution to AI alignment would be to build an agent with uncertainty about its utility function, so that by observing the environment and in particular us, it can learn our true utility function and optimize for that. And according to the problem of fully updated deference, trying to accomplish this would not significantly simplify our work because it involves two steps:
1) Learning our true utility function V (easy step)
this “merely” consists of knowing more about the world (if a perfect description W of the universe, which contains our true utility V somewhere in it, could be fed into the agent, then this step would be complete)
even a permanently misaligned agent would try to learn V for instrumental reasons (e.g. fooling us into thinking it is on our side)
2) Actually optimizing for V (hard step)
this requires that its meta utility function looks at its model of the world W, and uses a hardcoded procedure P that reliably points to the object that is our utility function V (if P is misspecified, the agent ends up optimizing V′≠V)
there is no “universal” way to specify P (an alien race would specify a different P, which means they would specify a different update rule, whereas e.g. the rules of Bayesian updating would be the same from star to star)
specifying P may be easier than specifying V, but is still hard (e.g. it may require that the programmers know in advance how to define a human so that the agent will be able to find the human objects in its world model and extract their V)
the agent would oppose attempts to modify P, much like agents who are certain about their utility function oppose attempts to modify it (the problem reproduces at the meta level)