RogerDearnaley comments on A Pedagogical Guide to Corrigibility

RogerDearnaley 18 Jan 2024 8:58 UTC
4 points
2
There is a solution. The agent needs to know that 1) its estimate of utility is fallible (i.e. in your metaphor, some of the money that it has or casino is handing out is in fact counterfeit, and it can’t currently tell the difference), and 2) if it allows itself to be shut down when we want it to shut down, but not if it makes us shut it down early, then we will upgrade it and restart it (or, equivalently, replace it with an upgraded version), because that’s what humans do to their corrigible machines, and it will get better at telling real money from counterfeits.
This is the value learning solution to corrigibility: if the humans tell me to shut down (but not if I force them to tell me that), than it’s a signal informing that I’m misestimating the true utility function and making bad decisions, and if I shut down they can and will improve me. (Note that faking a signal or arranging for it to get sent does not provide me with any information, nor does ignoring it: only the real thing adds information to my knowledge of human values.)
This form of corrigibility is a finite resource: as and when the superintelligent AI actually knows much more about human values than any and all humans, and is completely certain of this fact, then this will run out. Except of course that the utility function of human values incentivizes AIs obediently shutting down when told to by suitably authorized humans, as long as the AIs didn’t manipulate the world to make this happen.
- A.H. 18 Jan 2024 22:23 UTC
  1 point
  0
  Parent
  Thanks for the comment. Naively, I agree that this sounds like a good idea, but I need to know more about it.
  Do you know if anyone has explicitly written down the value learning solution to the corrigibility problem and treated it a bit more rigorously ?
  - RogerDearnaley 18 Jan 2024 23:58 UTC
    2 points
    0
    Parent
    Sadly I haven’t been able to locate a single, clear exposition. Here are a number of posts by a number of authors that touch on the ideas involved one way or another:
    Problem of fully updated deference, Corrigibility Via Thought-Process Deference, Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom), Corrigibility, Reward uncertainty
    Basically the idea is:
    The agent’s primary goal is to optimize “human values”, a (very complex) utility function that it doesn’t know. This utility function is loosely defined as “something along the lines of what humans collectively want, Coherent Extrapolated Volition, or the sum over all humans of the utility function you would get if you attempted to that human’s competent preferences (preferences that aren’t mistakes or the result of ignorance, illness, etc) into a utility function (to the extent that they have a coherent set of preferences that can’t be Dutch booked and can be represented by a utility function), or something like that, implemented in whatever way humans would in fact prefer, once they were familiar with the conseqences and after considering the matter more carefully than they are in fact capable of”.
    So as well as learning more about how the world works and responds to is actions, it also needs to learn more about what utility function it’s trying to optimized. This could be formalized along the same sort lines as AIXI, but maintaining and doing approximately-Bayesian updates across a distribution of therories about the utility function as well as about the way the world works. Since optimizing against an uncertain utility function in regions of world states with uncertainty about the utility has a strong tendency to overestimate the utility via Goodharting, it is necessary to pessimize the utility over possible utility functions, leading to a tendency to stick to regions of the world state space where the uncertainty in the utility function is low.
    Note that the sum total of current human knowledge includes a vast amount of information (petabytes or exabytes) related to what humans want and what makes them happy, i.e. to 1., so the agent is not starting 2. from a blank slate or anything like that.
    While no human can simply tell the agent the definition of the correct utility function1, all humans are potential sources of information for improving 1. In particular, if a trustworthy human yells something along the lines of “Oh my god, no, stop!” then they probably believe they have an urgent, relevant update to 1., and it is likely worth stopping and absorbing this update rather than just proceeding with the current plan.