Sadly I haven’t been able to locate a single, clear exposition. Here are a number of posts by a number of authors that touch on the ideas involved one way or another:
The agent’s primary goal is to optimize “human values”, a (very complex) utility function that it doesn’t know. This utility function is loosely defined as “something along the lines of what humans collectively want, Coherent Extrapolated Volition, or the sum over all humans of the utility function you would get if you attempted to that human’s competent preferences (preferences that aren’t mistakes or the result of ignorance, illness, etc) into a utility function (to the extent that they have a coherent set of preferences that can’t be Dutch booked and can be represented by a utility function), or something like that, implemented in whatever way humans would in fact prefer, once they were familiar with the conseqences and after considering the matter more carefully than they are in fact capable of”.
So as well as learning more about how the world works and responds to is actions, it also needs to learn more about what utility function it’s trying to optimized. This could be formalized along the same sort lines as AIXI, but maintaining and doing approximately-Bayesian updates across a distribution of therories about the utility function as well as about the way the world works. Since optimizing against an uncertain utility function in regions of world states with uncertainty about the utility has a strong tendency to overestimate the utility via Goodharting, it is necessary to pessimize the utility over possible utility functions, leading to a tendency to stick to regions of the world state space where the uncertainty in the utility function is low.
Note that the sum total of current human knowledge includes a vast amount of information (petabytes or exabytes) related to what humans want and what makes them happy, i.e. to 1., so the agent is not starting 2. from a blank slate or anything like that.
While no human can simply tell the agent the definition of the correct utility function1, all humans are potential sources of information for improving 1. In particular, if a trustworthy human yells something along the lines of “Oh my god, no, stop!” then they probably believe they have an urgent, relevant update to 1., and it is likely worth stopping and absorbing this update rather than just proceeding with the current plan.
Thanks for the comment. Naively, I agree that this sounds like a good idea, but I need to know more about it.
Do you know if anyone has explicitly written down the value learning solution to the corrigibility problem and treated it a bit more rigorously ?
Sadly I haven’t been able to locate a single, clear exposition. Here are a number of posts by a number of authors that touch on the ideas involved one way or another:
Problem of fully updated deference, Corrigibility Via Thought-Process Deference, Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom), Corrigibility, Reward uncertainty
Basically the idea is:
The agent’s primary goal is to optimize “human values”, a (very complex) utility function that it doesn’t know. This utility function is loosely defined as “something along the lines of what humans collectively want, Coherent Extrapolated Volition, or the sum over all humans of the utility function you would get if you attempted to that human’s competent preferences (preferences that aren’t mistakes or the result of ignorance, illness, etc) into a utility function (to the extent that they have a coherent set of preferences that can’t be Dutch booked and can be represented by a utility function), or something like that, implemented in whatever way humans would in fact prefer, once they were familiar with the conseqences and after considering the matter more carefully than they are in fact capable of”.
So as well as learning more about how the world works and responds to is actions, it also needs to learn more about what utility function it’s trying to optimized. This could be formalized along the same sort lines as AIXI, but maintaining and doing approximately-Bayesian updates across a distribution of therories about the utility function as well as about the way the world works. Since optimizing against an uncertain utility function in regions of world states with uncertainty about the utility has a strong tendency to overestimate the utility via Goodharting, it is necessary to pessimize the utility over possible utility functions, leading to a tendency to stick to regions of the world state space where the uncertainty in the utility function is low.
Note that the sum total of current human knowledge includes a vast amount of information (petabytes or exabytes) related to what humans want and what makes them happy, i.e. to 1., so the agent is not starting 2. from a blank slate or anything like that.
While no human can simply tell the agent the definition of the correct utility function1, all humans are potential sources of information for improving 1. In particular, if a trustworthy human yells something along the lines of “Oh my god, no, stop!” then they probably believe they have an urgent, relevant update to 1., and it is likely worth stopping and absorbing this update rather than just proceeding with the current plan.