When examining value learning approaches to AI Alignment, we run into two classes of problem—we want to understand how to elicit preferences, which is (even theoretically, with infinite computing power), very difficult, and we want to know how to go about aggregating preferences stably and correctly which is not just difficult but runs into complicated social choice and normative ethical issues.
Many research programs say the second of these questions is less important than the first, especially if we expect continuous takeoff with many chances to course-correct, and a low likelihood of an AI singleton with decisive strategic advantage. For many, building an AI that can reliably extract and pursue the preferences of one person is good enough.
Christiano calls this ‘the narrow approach’ and sees it as a way to sidestep many of the ethical issues, including those around social choice ethics. Those would be the ‘ambitious’ approaches.
We want to build machines that helps us do the things we want to do, and to that end they need to be able to understand what we are trying to do and what instrumental values guide our behavior. To the extent that our “preferences” are underdetermined or inconsistent, we are happy if our systems at least do as well as a human, and make the kinds of improvements that humans would reliably consider improvements.
But it’s not clear that anything short of the maximally ambitious approach can solve the problem we ultimately care about.
I think that the ambitious approach is still worth investigating, because it may well eventually need to be solved, and also because it may well need to be addressed in a more limited form even on the narrow approach (one could imagine an AGI with a lot of autonomy having to trade-off the preferences of, say, a hundred different people). But even the ‘narrow’ approach raises difficult psychological issues about how to distinguish legitimate preferences from bias—questions of elicitation. In other words, the cognitive science issues around elicitation (distinguishing bias from legitimate preference) must be resolved for any kind of preference learning to work, and the social choice and ethical issues around preference aggregation need at least preliminary solutions for any alignment method that aims to apply to more than one person (even if final, provably correct solutions to aggregation are only needed if designing a singleton with decisive strategic advantage).
I believe that I’ve located two areas that are under- or unexplored, for improving the ability of reward modelling approaches to elicit human preferences and to aggregate human preferences. These are: using multiple information sources from a human (approval and actions) which diverge to help extract unbiased preferences, and using RL proxy agents in iterated voting to reach consensus preference aggregations, rather than some direct statistical method. Neither of these is a complete solution, of course, for reasons discussed e.g. here by Stuart Armstrong, but they could nonetheless help.
Improving preference elicitation: multiple information sources
Eliciting the unbiased preferences of an individual human is extremely difficult, for reasons given here.
The agent’s actions can be explained by their beliefs and preferences[1], and by their biases: by this, we mean the way in which the action selector differs from an unboundedly rational expected preference maximiser.
The results of the Occam’s razor paper imply that preferences (and beliefs, and biases) cannot be deduced separately from knowing the agent’s policy (and hence, a fortiori, from any observations of the agent’s behaviour).
...
To get around the impossibility result, we need “normative assumptions”: assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don’t need many of these, at least for identifying human preferences. We can label a few examples (“the anchoring bias, as illustrated in this scenario, is a bias”; “people are at least weakly rational”; “humans often don’t think about new courses of action they’ve never seen before”, etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D
Yes, even on the ‘optimistic scenario’ we need external information of various kinds to ‘debias’. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.
This is still technically external to observing the human’s behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you’d get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.
In other words, the beliefs and preferences are unchanged when the agent acts or approves but the ‘approval selector’ is different from the ‘action selector’ sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.
So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.
So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don’t exhibit this pattern—for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.
There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.
CIRL ought to extract our revealed preferences (since it’s based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences—that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.
The goal here would be to have some kind of ‘dual channel’ preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I’m sure you’d still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step. Stuart Armstrong:
In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different “type signature”.
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other.
What I’ve suggested should still help at least somewhat in the pessimistic scenario—unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.
Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.
There has already been research done on using multiple information sources to improve the accuracy of preference learning—Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.
Improving preference aggregation: iterated voting games
In part because of arguments like these, there has been less focus on the aggregation side of things than on the direct preference learning side.
However, I think that it is important to get on the right track early—even if we never have cause to build a powerful singleton AI that has to aggregate all the preferences of humanity, there will still probably be smaller-scale situations where the preferences of several people need to be aggregated or traded-off. Shifting a human preference learner from a single to a small group of human preferences could produce erroneous results due to distributional shift, potentially causing alignment failures, so even if we aren’t trying for maximally ambitious value learning it might still be worth investigating preference aggregation.
We performed statistical modeling of participants’ pairwise comparisons between patient profiles in order to obtain weights for each profile. We used the Bradley-Terry model, which treats each pairwise comparison as a contest between a pair of players
We have shown one way in which moral judgments can be elicited from human subjects, how those judgments can be statistically modelled, and how the results can be incorporated into the algorithm. We have also shown, through simulations, what the likely effects of deploying such a prioritization system would be, namely that under demanded pairs would be significantly impacted but little would change for others. We do not make any judgment about whether this conclusion speaks in favor of or against such prioritization, but expect the conclusion to be robust to changes in the prioritization such as those that would result from a more thorough process, as described in the previous paragraph.
The Kidney exchange paper elicited preferences from human subjects (using repeated pairwise comparisons) and then aggregated them using the Bradley-Terry model. You couldn’t use such a simple statistical method to aggregate quantitative preferences over continuous action spaces, like the preferences that would be learned from a human via a complex reward model. Also, any time you try to use some specific one-shot voting mechanism you run into various impossibility theorems which seem to force you to give up some desirable property.
One approach that may be more robust against errors in a voting mechanism, and easily scalable to more complex preference profiles is to use RL not just for the preference elicitation, but also for the preference aggregation. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents’ ability to vote strategically as an opportunity to reach stable outcomes.
This paper uses very simple Q-learning agents with a few different policies—epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. (Note the similarity and differences with the moral parliament, where a particular one-shot voting rule is justified a priori and then used.)
The fact that this paper exists is a good sign because it’s very recent and the methods it uses are very simple—it’s pretty much just a proof of concept, as the authors state—so that tells me there’s a lot of room for combining more sophisticated RL with better voting methods.
Combining elicitation and aggregation
Having elicited preferences from each individual human (using methods like those above to ‘debias’), we obtain a proxy agent representing each individual’s preferences. Then these agents can be placed into an iterated voting situation until a convergent answer is reached.
That seems like the closest practical approximation to a CEV of a group of people that could be constructed with anything close to current methods—a pipeline from observed behaviour and elicited approval to a final aggregated decision about what to do based on overall preferences. Since its a value learning framework that’s extendible over any size group, which is somewhat indirect, you might call it a Coherent Extrapolated Framework (CEF) as I suggested last year.
So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:
With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
Create ‘proxy agents’ using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.
This seems like a reasonable procedure for extending a method that is aligned to one human’s preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.
Even though we can’t yet align an AGI to one human’s preferences, trying to think about how to aggregate human preferences in a way that is scalable isn’t premature, as has sometimes been claimed.
In many ‘non-ambitious’ hypothetical settings where we aren’t trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn’t approach the question of preference aggregation from a ‘final’ ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.
However, if you did want to use such a method to try and produce the fabled ‘final utility function of all humanity’, it might not give you Humanity’s CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.
Improving preference learning approaches
When examining value learning approaches to AI Alignment, we run into two classes of problem—we want to understand how to elicit preferences, which is (even theoretically, with infinite computing power), very difficult, and we want to know how to go about aggregating preferences stably and correctly which is not just difficult but runs into complicated social choice and normative ethical issues.
Many research programs say the second of these questions is less important than the first, especially if we expect continuous takeoff with many chances to course-correct, and a low likelihood of an AI singleton with decisive strategic advantage. For many, building an AI that can reliably extract and pursue the preferences of one person is good enough.
Christiano calls this ‘the narrow approach’ and sees it as a way to sidestep many of the ethical issues, including those around social choice ethics. Those would be the ‘ambitious’ approaches.
I think that the ambitious approach is still worth investigating, because it may well eventually need to be solved, and also because it may well need to be addressed in a more limited form even on the narrow approach (one could imagine an AGI with a lot of autonomy having to trade-off the preferences of, say, a hundred different people). But even the ‘narrow’ approach raises difficult psychological issues about how to distinguish legitimate preferences from bias—questions of elicitation. In other words, the cognitive science issues around elicitation (distinguishing bias from legitimate preference) must be resolved for any kind of preference learning to work, and the social choice and ethical issues around preference aggregation need at least preliminary solutions for any alignment method that aims to apply to more than one person (even if final, provably correct solutions to aggregation are only needed if designing a singleton with decisive strategic advantage).
I believe that I’ve located two areas that are under- or unexplored, for improving the ability of reward modelling approaches to elicit human preferences and to aggregate human preferences. These are: using multiple information sources from a human (approval and actions) which diverge to help extract unbiased preferences, and using RL proxy agents in iterated voting to reach consensus preference aggregations, rather than some direct statistical method. Neither of these is a complete solution, of course, for reasons discussed e.g. here by Stuart Armstrong, but they could nonetheless help.
Improving preference elicitation: multiple information sources
Eliciting the unbiased preferences of an individual human is extremely difficult, for reasons given here.
...
Yes, even on the ‘optimistic scenario’ we need external information of various kinds to ‘debias’. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.
This is still technically external to observing the human’s behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you’d get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.
In other words, the beliefs and preferences are unchanged when the agent acts or approves but the ‘approval selector’ is different from the ‘action selector’ sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.
So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don’t exhibit this pattern—for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.
There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.
CIRL ought to extract our revealed preferences (since it’s based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences—that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.
The goal here would be to have some kind of ‘dual channel’ preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I’m sure you’d still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step. Stuart Armstrong:
What I’ve suggested should still help at least somewhat in the pessimistic scenario—unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.
Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.
There has already been research done on using multiple information sources to improve the accuracy of preference learning—Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.
Improving preference aggregation: iterated voting games
In part because of arguments like these, there has been less focus on the aggregation side of things than on the direct preference learning side.
However, I think that it is important to get on the right track early—even if we never have cause to build a powerful singleton AI that has to aggregate all the preferences of humanity, there will still probably be smaller-scale situations where the preferences of several people need to be aggregated or traded-off. Shifting a human preference learner from a single to a small group of human preferences could produce erroneous results due to distributional shift, potentially causing alignment failures, so even if we aren’t trying for maximally ambitious value learning it might still be worth investigating preference aggregation.
There has been some research done on preference aggregation for AIs learning human values, specifically in the context of Kidney exchanges:
The Kidney exchange paper elicited preferences from human subjects (using repeated pairwise comparisons) and then aggregated them using the Bradley-Terry model. You couldn’t use such a simple statistical method to aggregate quantitative preferences over continuous action spaces, like the preferences that would be learned from a human via a complex reward model. Also, any time you try to use some specific one-shot voting mechanism you run into various impossibility theorems which seem to force you to give up some desirable property.
One approach that may be more robust against errors in a voting mechanism, and easily scalable to more complex preference profiles is to use RL not just for the preference elicitation, but also for the preference aggregation. The idea is that we embrace the inevitable impossibility results (such as Arrow and GS theorems) and consider agents’ ability to vote strategically as an opportunity to reach stable outcomes.
This paper uses very simple Q-learning agents with a few different policies—epsilon-greedy, greedy and upper confidence bound, in an iterated voting game, and gets behaviour that seems sensible. (Note the similarity and differences with the moral parliament, where a particular one-shot voting rule is justified a priori and then used.)
The fact that this paper exists is a good sign because it’s very recent and the methods it uses are very simple—it’s pretty much just a proof of concept, as the authors state—so that tells me there’s a lot of room for combining more sophisticated RL with better voting methods.
Combining elicitation and aggregation
Having elicited preferences from each individual human (using methods like those above to ‘debias’), we obtain a proxy agent representing each individual’s preferences. Then these agents can be placed into an iterated voting situation until a convergent answer is reached.
That seems like the closest practical approximation to a CEV of a group of people that could be constructed with anything close to current methods—a pipeline from observed behaviour and elicited approval to a final aggregated decision about what to do based on overall preferences. Since its a value learning framework that’s extendible over any size group, which is somewhat indirect, you might call it a Coherent Extrapolated Framework (CEF) as I suggested last year.
So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:
With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
Create ‘proxy agents’ using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.
This seems like a reasonable procedure for extending a method that is aligned to one human’s preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.
Even though we can’t yet align an AGI to one human’s preferences, trying to think about how to aggregate human preferences in a way that is scalable isn’t premature, as has sometimes been claimed.
In many ‘non-ambitious’ hypothetical settings where we aren’t trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn’t approach the question of preference aggregation from a ‘final’ ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.
However, if you did want to use such a method to try and produce the fabled ‘final utility function of all humanity’, it might not give you Humanity’s CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.