You’re asking the wrong question—partly because of confusion over the term ‘utility function’.
We want the AI to embody human values through a utility function that is a reasonable approximation to the hypothetical ideal human group utility function that some large organization of humans (or just—humanity) would encompass if they had unbounded amounts of time to reach consensus on the actions the AI takes.
That ideal utility function is—for practical purposes—impossible to define directly or hand engineer—its far too complex.
To illustrate why, consider the much simpler problem of a narrow AI that just recognizes images—a computer vision system. The vision AI takes an image as an input and then produces an action output. The ideal utility function over input,output pairs is again defined by the action a committee of humans would take given enough time. We don’t actually hand engineer the decision utility function for vision: again its too complex. Instead the best approach is to define the vision system’s utility function indirectly, based on labeled examples. Defining the system’s goals that way leads to a tractable inference problem with a well defined optimization criteria.
The same general approach can scale up to more complex AGI systems. To avoid the need for huge hand labeled training datasets, we can use techniques such as inverse reinforcement learning where we first use an inference procedure to recover estimations of human utility functions. Then we can use these recovered utility functions in a general reinforcement learning framework as replacement for a hardwired reward function (as in AIXI).
So, in short, the goals of any complex AGI are unlikely to be explicitly written down in any language—at least not directly. Using the techniques described above, the goals/values come from training data collected from human decisions. The challenge then becomes building a training program that can significantly cover the space of human ethics/morality. Eventually we will be able to do that using virtual reality environments, but there may be even easier techniques involving clever uses of brain imaging.
I can agree with some of your points, but interestingly, many commenters prefer a very rigorously defined utility function defined in the lower possible language instead of your heuristically developed one, because they argue that its exact functionality has to be provable.
The types of decision utility functions that we can define precisely for an AI are exactly the kind that we absolutely do not want—namely the class of model-free reward functions. That works for training an agent to play atari games based on a score function provided by the simulated environment, but it just doesn’t scale to the real world which doesn’t come with a convenient predefined utility function.
For AGI, we need a model based utility function, which maps internal world states to human relevant utility values. As the utility function is then dependent on the AGI’s internal predictive world model, you would then need to rigorously define the AGI’s entire world model. That appears to be a rather hopelessly naive dead end. I’m not aware of any progress or research that indicates that approach is viable. Are you?
Instead all current research progress trends strongly indicate that the first practical AGI designs will be based heavily on inferring human values indirectly. Proving safety for alternate designs—even if possible—has little value if those results do not apply to the designs which will actually win the race to superintelligence.
Also—there is a whole math research tract in machine learning concerned with provable bounds on loss and prediction accuracy—so it’s not simply true that using machine learning techniques to infer human utility functions necessitates ‘heuristics’ ungrounded in any formal analysis.
You’re asking the wrong question—partly because of confusion over the term ‘utility function’.
We want the AI to embody human values through a utility function that is a reasonable approximation to the hypothetical ideal human group utility function that some large organization of humans (or just—humanity) would encompass if they had unbounded amounts of time to reach consensus on the actions the AI takes.
That ideal utility function is—for practical purposes—impossible to define directly or hand engineer—its far too complex.
To illustrate why, consider the much simpler problem of a narrow AI that just recognizes images—a computer vision system. The vision AI takes an image as an input and then produces an action output. The ideal utility function over input,output pairs is again defined by the action a committee of humans would take given enough time. We don’t actually hand engineer the decision utility function for vision: again its too complex. Instead the best approach is to define the vision system’s utility function indirectly, based on labeled examples. Defining the system’s goals that way leads to a tractable inference problem with a well defined optimization criteria.
The same general approach can scale up to more complex AGI systems. To avoid the need for huge hand labeled training datasets, we can use techniques such as inverse reinforcement learning where we first use an inference procedure to recover estimations of human utility functions. Then we can use these recovered utility functions in a general reinforcement learning framework as replacement for a hardwired reward function (as in AIXI).
So, in short, the goals of any complex AGI are unlikely to be explicitly written down in any language—at least not directly. Using the techniques described above, the goals/values come from training data collected from human decisions. The challenge then becomes building a training program that can significantly cover the space of human ethics/morality. Eventually we will be able to do that using virtual reality environments, but there may be even easier techniques involving clever uses of brain imaging.
I can agree with some of your points, but interestingly, many commenters prefer a very rigorously defined utility function defined in the lower possible language instead of your heuristically developed one, because they argue that its exact functionality has to be provable.
The types of decision utility functions that we can define precisely for an AI are exactly the kind that we absolutely do not want—namely the class of model-free reward functions. That works for training an agent to play atari games based on a score function provided by the simulated environment, but it just doesn’t scale to the real world which doesn’t come with a convenient predefined utility function.
For AGI, we need a model based utility function, which maps internal world states to human relevant utility values. As the utility function is then dependent on the AGI’s internal predictive world model, you would then need to rigorously define the AGI’s entire world model. That appears to be a rather hopelessly naive dead end. I’m not aware of any progress or research that indicates that approach is viable. Are you?
Instead all current research progress trends strongly indicate that the first practical AGI designs will be based heavily on inferring human values indirectly. Proving safety for alternate designs—even if possible—has little value if those results do not apply to the designs which will actually win the race to superintelligence.
Also—there is a whole math research tract in machine learning concerned with provable bounds on loss and prediction accuracy—so it’s not simply true that using machine learning techniques to infer human utility functions necessitates ‘heuristics’ ungrounded in any formal analysis.