We want a method for creating agents that update their utility function over time. That is, we want:
A method for “pointing to” a utility function (such as “human values”) indirectly, without giving an explicit statement of the utility function in question.
A method for “clarifying” a utility function specified with the method given in (1), so that you in the limit of infinite information obtain an explicit/concrete utility function.
A method for creating an agent that uses an indirectly specified utility function, such that:
The agent at any given time takes actions which are sensible given its current beliefs about its utility function.
The agent will try to find information that would help it to clarify it’s utility function.
The agent would resist attempts to change its utility function away from its indirectly specified utility function.
This problem statement is of course somewhat loose, but that is by necessity, since we don’t yet have a clear idea of what it really means to define utility functions “indirectly” (in the sense we are interested in here).
What’s interesting to me is that your partial solution sorts of follows for free from this “definition”. It requires an initial state, an improvement process, and a way to act given the current state of the process. What you add after that is mostly the analogy to mathematical limits — the improvement being split into infintely many steps that still give you a well defined result in the limit.
It’s a pretty good application of the idea that getting the right definition is the hardest part (isn’t it the problem with human values, really?). From this it also follows that the potential problem with your solution probably come from the problem statement. Which is good to know when critically examining it.
On problem statement
What’s interesting to me is that your partial solution sorts of follows for free from this “definition”. It requires an initial state, an improvement process, and a way to act given the current state of the process. What you add after that is mostly the analogy to mathematical limits — the improvement being split into infintely many steps that still give you a well defined result in the limit.
It’s a pretty good application of the idea that getting the right definition is the hardest part (isn’t it the problem with human values, really?). From this it also follows that the potential problem with your solution probably come from the problem statement. Which is good to know when critically examining it.