Updating Utility Functions
This post will be about AIs that “refine” their utility function over time, and how it might be possible to construct such systems without giving them undesirable properties. The discussion relates to corrigibility, value learning, and (to a lesser extent) wireheading.
We (Joar Skalse and Justin Shovelain) have spent some time discussing this topic, and we have gained a few new insights we wish to share. The aim of this post is to be a brief but explanatory summary of those insights. We will provide some motivating intuitions, a problem statement, and a possible partial solution to the problem given in the problem statement. We do not have a complete technical solution to the problem, but one could perhaps be built on this partial solution.
Sections which can be skipped are marked with an asterisk (*).
Brief Background*
This section says things that you probably already know. The main purpose of it is to prime you.
In the “classical” picture of AI systems, the AI contains a utility function that encodes a goal that it is trying to accomplish. The AI then selects actions whose outcome it expects will yield high utility (roughly). For example, the utility function might be equal to the number of paperclips in existence, in which case the AI would try to take actions that result in many paperclips.
In the “classical” picture, the utility function is fixed over time, and corresponds to an equation that at some point is typed into the AI’s source code. Unfortunately, we humans don’t really know what we want, so we cannot provide such an equation. If we try to propose a specific utility function directly, we typically get a function that would result in catastrophic consequences if it were pursued with arbitrary competence. This is worrying.
This problem could perhaps be alleviated if we could construct AIs that can refine their utility function over time. For example, maybe we could create an AI that starts out with an imperfect understanding of human values, but then improves that understanding over time. Such an AI should ideally “want” to improve its understanding of human values (and actively come up with ways to do this), and it should at minimum not resist if humans attempt to update it. Unfortunately, it turns out to be difficult to design such systems. In this post we will talk more about this approach.
A Puzzle of Reference*
Consider this puzzle: I am able to talk and reason about ”human values”. However, I cannot define human values, or give you a definite description of what human values are – if I could do this, I could solve a large part of the AI alignment problem by writing down a safe utility function directly. I can also not give you a method for finding out what human values are – if I could do this, I could solve the problem of Inverse Reinforcement Learning. Moreover, I don’t think I could reliably recognize human values either – if you show me a bunch of utility functions, I might not be able to tell if any of them encodes human values. I’m not even sure if I could reliably recognize methods for finding out what human values are – if you show me a proposal for how to do Inverse Reinforcement Learning, I might not be able to tell whether the method truly learns human values.
In spite of all this, the term “human values” means something when I say it – it has semantic content, and refers to some (abstract) object. How does this work? What makes it so that the term “human values” even has any meaning at all when I say it? And, given that it has a meaning, what makes it so that it has the particular meaning it does? It seems like some feature of human cognition and/or language can make it possible for us to refer to certain things that we have very little information about. What is the mechanism behind this, and could it be used when defining utility functions in AI systems?
Problem Statement
We want a method for creating agents that update their utility function over time. That is, we want:
A method for “pointing to” a utility function (such as “human values”) indirectly, without giving an explicit statement of the utility function in question.
A method for “clarifying” a utility function specified with the method given in (1), so that you in the limit of infinite information obtain an explicit/concrete utility function.
A method for creating an agent that uses an indirectly specified utility function, such that:
The agent at any given time takes actions which are sensible given its current beliefs about its utility function.
The agent will try to find information that would help it to clarify it’s utility function.
The agent would resist attempts to change its utility function away from its indirectly specified utility function.
This problem statement is of course somewhat loose, but that is by necessity, since we don’t yet have a clear idea of what it really means to define utility functions “indirectly” (in the sense we are interested in here).
Utility Functions and Intensional Semantics*
What is in this section is a tangent about wireheading—it might be interesting to read this while thinking about this topic, but it is not necessary to do so.
How should an AI evaluate plans if its utility function changes over time? Suppose we have an AI that currently has utility function U1, and that it considers a plan P that would lead to outcome O, where in O the AI would have the utility function U2. Should the utility of P be defined as U1(O) or U2(O)? If it’s U1(O) then the AI is maximizing its utility function de re, and if it’s U2(O) then it’s maximizing its utility function de dicto. Which is more sensible?
In brief, an AI that maximizes utility de re will resist attempts to modify its current utility function, and thus not satisfy (3). An AI that maximizes utility de dicto would wirehead, and thus also not satisfy (3). An AI that maximizes utility de re would not wirehead.
This is perhaps a somewhat interesting observation, but it doesn’t help us solve (1)-(3).
Limiting Utility Functions—Possibly a Partial Solution
Let’s define a process P that generates a sequence of utility functions {Ui}. We call this a utility function defining process. An example of such a process P could be the following:
P is an episodic process, the input and output to which is one proposed human utility function and one set of notes. Given these, P runs n human brain emulations (EMs) for m subjective years. The brains can speak with each other, and have a copy of the internet that they can access. The EMs are meant to use this time to figure out what human preferences are. At the end of the episode they output their best guess, together with a set of notes for their successors to read. By chaining P to itself we obtain a sequence of utility functions {Ui}.
We would like to stress that this process P is an example, and not the central point of this post.
Suppose (for the sake of the argument) that the sequence of utility functions {Ui} generated by this process P has a well-defined limit U∞ (in the ordinary mathematical sense of a limit). We can then define an AI system whose utility function is to maximize lim i→∞ Ui (= U∞). It seems as though such a system would satisfy many of the properties in (1)-(3). In particular:
The AI should at any given time take actions that are good according to most of the plausible values of U∞.
The AI would be incentivized to gather information that would help it learn more about U∞.
The AI would not be incentivized to gather information about U∞ at the expense of maximizing U∞ (eg, it would not be incentivized to run “unethical experiments”).
The AI would be incentivized to resist changes to its utility function that would mean that it’s no longer aiming to maximize U∞.
The AI should be keen to maintain option value as it learns more about U∞, until it’s very confident about what U∞ looks like.
Overall, it seems like such an AI would satisfy most of the properties we would want an AI with an updating utility function to have.
To clarify, note that we are not saying that you run the utility function defining process P to convergence and then write the utility function you end up with into the AI – you would not need to run P at all. The purpose of P is to point to U∞ – the work of actually finding out what U∞ is is offloaded onto the AI. The AI might of course do this by actually running P, but if P is very complex (as in the example above) then the AI could also use other methods for gaining information about U∞.
Again, we stress that the point here isn’t the specific process P we propose above – that is just an example. As far as the approach is concerned, you could use any well-defined process that produces a sequence of utility functions that converges to a well-defined limit.
Issues
There are a few issues with this approach. Notably:
The approach is very unwieldy, and it seems like it requires a fairly high minimum level of intelligence to work. For example, it couldn’t be used as-is with a contemporary RL agent.
It’s not clear what would be needed to use this approach with an AI that starts out below the minimum required level of intelligence, but then gets more intelligent over time.
The nitty-gritty details of getting an AI system to maximize the limit of a mathematical sequence would in general presumably require good methods for dealing with logical uncertainty.
We still need to provide a specific process P, such that we are sure that P has a well-defined limit, and such we are confident that this limit corresponds to the utility function that we are actually interested in.
Note however that this might be much easier than, for example, solving Inverse Reinforcement Learning. For example, there isn’t really any need for P to be efficient or practical to run.
With the current version of this approach, all the information required to figure out what U∞ is must in some sense be contained within P from the start. This is problematic – what if it’s not possible to figure out what human values are based on all information that can be accessed when the system is deployed? For example, what if you need some facts about the human brain that just aren’t in the scientific literature at the time?
One way to get around this is to allow P to request new external information (by proposing an experiment to run, for example). However, this introduces new difficulties. Depending on what information is requested, this could make the value of U∞ depend on contingencies in the real world. In particular, it could make the value of U∞ depend on things that the AI can influence. For example, if P requests that a survey is run then the AI could probably influence the outcome of that survey (and the outcome would also depend on the specific time at which the survey is run, etc etc). In this case it’s unclear how you would even ensure that U∞ is well-defined, and it seems very difficult to ensure that the AI still has the intended incentives.
Nonetheless, it seems like this approach has many nice and desirable properties, and the issues are not fatal, so it might still be possible to use this approach in an AI system, or build on it to create an even better approach.
Conclusion
In summary, we want a method for pointing to utility functions that works even if we don’t have a concrete expression of that function (like how I can point to human values by saying “human values”, even though I can’t say much about them). We also want a method for making an AI system maximize a function that has been pointed to in this way, which doesn’t incentivize bad behavior.
We have proposed a possible approach for doing this, which is to define a mathematical or computational process that generates a sequence of utility functions, which limits to some well-defined utility function, and then have the AI system try to maximize that limit function. This gives us a quite flexible way to define utility functions, and the resulting AI system seems to get the incentives we would want.
This approach has a few limitations. The most problematic of these is probably that it seems to induce a fairly large overhead cost, in terms of computational complexity, in terms of the complexity of the code, and in terms of how intelligent the AI system would have to be. Other issues include defining the utility function generating process, ensuring that it has a well-defined limit, and ensuring that that limit is the function we intend. However, these issues are probably less significant by comparison, since other methods for defining AGI utility functions usually have similar issues.
The prompting idea for this post was from Justin Shovelain, Joar Skalse and Justin Shovelain collaboratively came up with the much improved Updating Utility Functions idea, and Joar Skalse was the primary writer.
Thanks for the post! I broadly like the abstract perspective, and agree with most of your claims. That being said, I still have a bunch of comments on the post itself.
(To make threads more readable, I separated my points into subcomments)
On similarity of issues with other schemes
Actually, I would say that having the same core issues than other methods for defining AGI utility functions (which are not known to work) shows that your insight is not a solution but a (potentially productive) reframing of the problem.
On limiting utility functions
Assuming you get such a process pointing towards human values, then I expect to get the properties you’re describing, which are pretty good.
There is still one potential issue: the AI needs to be able to use P and Ui (it’s current utility function) to guess enough of the limit so that it can be competitive (Footnote: something like a Cauchy criterion?). Otherwise the AI risks failing to crippling uncertainty about what it can and cannot do, as in principle U∞ could be anything.
With that said, it still sounds like the noticeably hardest part of the problem has been “hidden away” in P (as you point out it in the issue section). It’s always hard to point at something and say that this is the hard part of the problem, but I’m pretty confident that getting a process that does converge towards human values and has the competitiveness constraint above is the main problem here.
Thus this posts seems to provide an alternative “type” for a solution to value learning, in the shape of such a sequence. It sounds similar to other things in the literature, like IDA and Recursive Reward Modelling, but with a lack of built-in human feedback mechanism that makes it more abstract. So I expect that exploring this abstract framing and the constraints that fall from it might tell us interesting and useful things about the viability of solutions of this type (including their potential impossibility).
On problem statement
What’s interesting to me is that your partial solution sorts of follows for free from this “definition”. It requires an initial state, an improvement process, and a way to act given the current state of the process. What you add after that is mostly the analogy to mathematical limits — the improvement being split into infintely many steps that still give you a well defined result in the limit.
It’s a pretty good application of the idea that getting the right definition is the hardest part (isn’t it the problem with human values, really?). From this it also follows that the potential problem with your solution probably come from the problem statement. Which is good to know when critically examining it.
On human values and unbounded optimization
One useful tool to argue that we can’t define “human values” at the moment (that isn’t explicitly used here but which you probably know about) is thinking about what happens in the limit of optimization. Many utility functions are recognizable decent proxies for “human values” in the regime of low optimization; it’s when the optimization becomes enormous and unbounded that we lose our ability to foresee the consequences, due to logical non-omniscience.
Also note that the question of whether the resulting world (after unbounded optimization of the utility function) can be recognized as against “human values” is more debated.
On classical picture
You obviously know this, but it could be valuable to add that this is an idealized situation that is “easier” than the one we probably will find ourselves with (where the utility function, if it is the right abstraction, is learned rather than fully specified).
It feel like you’re making the move of aiming for a simpler problem that is still capturing the core of the difficulty and confusion, to tackle it with minimal details to deal with. Which I’m on board with, but being explicit about this move could save you some time justifying some of your design choices.