Thanks for the post! I broadly like the abstract perspective, and agree with most of your claims. That being said, I still have a bunch of comments on the post itself.
(To make threads more readable, I separated my points into subcomments)
Other issues include defining the utility function generating process, ensuring that it has a well-defined limit, and ensuring that that limit is the function we intend. However, these issues are probably less significant by comparison, since other methods for defining AGI utility functions usually have similar issues.
Actually, I would say that having the same core issues than other methods for defining AGI utility functions (which are not known to work) shows that your insight is not a solution but a (potentially productive) reframing of the problem.
Let’s define a process P that generates a sequence of utility functions {Ui}. We call this a utility function defining process.
[...]
We would like to stress that this process P is an example, and not the central point of this post.
Suppose (for the sake of the argument) that the sequence of utility functions {Ui} generated by this process P has a well-defined limit U∞ (in the ordinary mathematical sense of a limit). We can then define an AI system whose utility function is to maximize lim i→∞ Ui (= U∞). It seems as though such a system would satisfy many of the properties in (1)-(3). In particular:
The AI should at any given time take actions that are good according to most of the plausible values of U∞.
The AI would be incentivized to gather information that would help it learn more about U∞.
The AI would not be incentivized to gather information about U∞ at the expense of maximizing U∞ (eg, it would not be incentivized to run “unethical experiments”).
The AI would be incentivized to resist changes to its utility function that would mean that it’s no longer aiming to maximize U∞.
The AI should be keen to maintain option value as it learns more about U∞, until it’s very confident about what U∞ looks like.
Overall, it seems like such an AI would satisfy most of the properties we would want an AI with an updating utility function to have.
Assuming you get such a process pointing towards human values, then I expect to get the properties you’re describing, which are pretty good.
There is still one potential issue: the AI needs to be able to use P and Ui (it’s current utility function) to guess enough of the limit so that it can be competitive (Footnote: something like a Cauchy criterion?). Otherwise the AI risks failing to crippling uncertainty about what it can and cannot do, as in principle U∞ could be anything.
With that said, it still sounds like the noticeably hardest part of the problem has been “hidden away” in P (as you point out it in the issue section). It’s always hard to point at something and say that this is the hard part of the problem, but I’m pretty confident that getting a process that does converge towards human values and has the competitiveness constraint above is the main problem here.
Thus this posts seems to provide an alternative “type” for a solution to value learning, in the shape of such a sequence. It sounds similar to other things in the literature, like IDA and Recursive Reward Modelling, but with a lack of built-in human feedback mechanism that makes it more abstract. So I expect that exploring this abstract framing and the constraints that fall from it might tell us interesting and useful things about the viability of solutions of this type (including their potential impossibility).
We want a method for creating agents that update their utility function over time. That is, we want:
A method for “pointing to” a utility function (such as “human values”) indirectly, without giving an explicit statement of the utility function in question.
A method for “clarifying” a utility function specified with the method given in (1), so that you in the limit of infinite information obtain an explicit/concrete utility function.
A method for creating an agent that uses an indirectly specified utility function, such that:
The agent at any given time takes actions which are sensible given its current beliefs about its utility function.
The agent will try to find information that would help it to clarify it’s utility function.
The agent would resist attempts to change its utility function away from its indirectly specified utility function.
This problem statement is of course somewhat loose, but that is by necessity, since we don’t yet have a clear idea of what it really means to define utility functions “indirectly” (in the sense we are interested in here).
What’s interesting to me is that your partial solution sorts of follows for free from this “definition”. It requires an initial state, an improvement process, and a way to act given the current state of the process. What you add after that is mostly the analogy to mathematical limits — the improvement being split into infintely many steps that still give you a well defined result in the limit.
It’s a pretty good application of the idea that getting the right definition is the hardest part (isn’t it the problem with human values, really?). From this it also follows that the potential problem with your solution probably come from the problem statement. Which is good to know when critically examining it.
Consider this puzzle: I am able to talk and reason about ”human values”. However, I cannot define human values, or give you a definite description of what human values are – if I could do this, I could solve a large part of the AI alignment problem by writing down a safe utility function directly. I can also not give you a method for finding out what human values are – if I could do this, I could solve the problem of Inverse Reinforcement Learning. Moreover, I don’t think I could reliably recognize human values either – if you show me a bunch of utility functions, I might not be able to tell if any of them encodes human values. I’m not even sure if I could reliably recognize methods for finding out what human values are – if you show me a proposal for how to do Inverse Reinforcement Learning, I might not be able to tell whether the method truly learns human values.
One useful tool to argue that we can’t define “human values” at the moment (that isn’t explicitly used here but which you probably know about) is thinking about what happens in the limit of optimization. Many utility functions are recognizable decent proxies for “human values” in the regime of low optimization; it’s when the optimization becomes enormous and unbounded that we lose our ability to foresee the consequences, due to logical non-omniscience.
Also note that the question of whether the resulting world (after unbounded optimization of the utility function) can be recognized as against “human values” is more debated.
In the “classical” picture, the utility function is fixed over time, and corresponds to an equation that at some point is typed into the AI’s source code. Unfortunately, we humans don’t really know what we want, so we cannot provide such an equation. If we try to propose a specific utility function directly, we typically get a function that would result in catastrophicconsequences if it were pursued with arbitrary competence. This is worrying.
You obviously know this, but it could be valuable to add that this is an idealized situation that is “easier” than the one we probably will find ourselves with (where the utility function, if it is the right abstraction, is learned rather than fully specified).
It feel like you’re making the move of aiming for a simpler problem that is still capturing the core of the difficulty and confusion, to tackle it with minimal details to deal with. Which I’m on board with, but being explicit about this move could save you some time justifying some of your design choices.
Thanks for the post! I broadly like the abstract perspective, and agree with most of your claims. That being said, I still have a bunch of comments on the post itself.
(To make threads more readable, I separated my points into subcomments)
On similarity of issues with other schemes
Actually, I would say that having the same core issues than other methods for defining AGI utility functions (which are not known to work) shows that your insight is not a solution but a (potentially productive) reframing of the problem.
On limiting utility functions
Assuming you get such a process pointing towards human values, then I expect to get the properties you’re describing, which are pretty good.
There is still one potential issue: the AI needs to be able to use P and Ui (it’s current utility function) to guess enough of the limit so that it can be competitive (Footnote: something like a Cauchy criterion?). Otherwise the AI risks failing to crippling uncertainty about what it can and cannot do, as in principle U∞ could be anything.
With that said, it still sounds like the noticeably hardest part of the problem has been “hidden away” in P (as you point out it in the issue section). It’s always hard to point at something and say that this is the hard part of the problem, but I’m pretty confident that getting a process that does converge towards human values and has the competitiveness constraint above is the main problem here.
Thus this posts seems to provide an alternative “type” for a solution to value learning, in the shape of such a sequence. It sounds similar to other things in the literature, like IDA and Recursive Reward Modelling, but with a lack of built-in human feedback mechanism that makes it more abstract. So I expect that exploring this abstract framing and the constraints that fall from it might tell us interesting and useful things about the viability of solutions of this type (including their potential impossibility).
On problem statement
What’s interesting to me is that your partial solution sorts of follows for free from this “definition”. It requires an initial state, an improvement process, and a way to act given the current state of the process. What you add after that is mostly the analogy to mathematical limits — the improvement being split into infintely many steps that still give you a well defined result in the limit.
It’s a pretty good application of the idea that getting the right definition is the hardest part (isn’t it the problem with human values, really?). From this it also follows that the potential problem with your solution probably come from the problem statement. Which is good to know when critically examining it.
On human values and unbounded optimization
One useful tool to argue that we can’t define “human values” at the moment (that isn’t explicitly used here but which you probably know about) is thinking about what happens in the limit of optimization. Many utility functions are recognizable decent proxies for “human values” in the regime of low optimization; it’s when the optimization becomes enormous and unbounded that we lose our ability to foresee the consequences, due to logical non-omniscience.
Also note that the question of whether the resulting world (after unbounded optimization of the utility function) can be recognized as against “human values” is more debated.
On classical picture
You obviously know this, but it could be valuable to add that this is an idealized situation that is “easier” than the one we probably will find ourselves with (where the utility function, if it is the right abstraction, is learned rather than fully specified).
It feel like you’re making the move of aiming for a simpler problem that is still capturing the core of the difficulty and confusion, to tackle it with minimal details to deal with. Which I’m on board with, but being explicit about this move could save you some time justifying some of your design choices.