To get around the impossibility result, we need “normative assumptions”: assumptions about the preferences (or beliefs, or biases) of the agent that cannot be deduced fully from observations.
Under the optimistic scenario, we don’t need many of these, at least for identifying human preferences. We can label a few examples (“the anchoring bias, as illustrated in this scenario, is a bias”; “people are at least weakly rational”; “humans often don’t think about new courses of action they’ve never seen before”, etc...). Call this labelled data[2] D.
The algorithm now constructs categories preferences*, beliefs*, and biases* - these are the generalisations that it has achieved from D
Yes, even on the ‘optimistic scenario’ we need external information of various kinds to ‘debias’. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.
This is still technically external to observing the human’s behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you’d get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.
In other words, the beliefs and preferences are unchanged when the agent acts or approves but the ‘approval selector’ is different from the ‘action selector’ sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.
So, for example, if we conducted a principle component analysis on π, we would expect that the components would all be mixes of preferences/beliefs/biases.
So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don’t exhibit this pattern—for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.
There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.
CIRL ought to extract our revealed preferences (since it’s based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences—that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.
The goal here would be to have some kind of ‘dual channel’ preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I’m sure you’d still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step.
In the pessimistic scenario, human preferences, biases, and beliefs are twisted together is a far more complicated way, and cannot be separated by a few examples.
In contrast, take examples of racial bias, hindsight bias, illusion of control, or naive realism. These biases all seem to be of quite different from the anchoring bias, and quite different from each other. At the very least, they seem to be of different “type signature”.
So, under the pessimistic scenario, some biases are much closer to preferences that generic biases (and generic preferences) are to each other.
What I’ve suggested should still help at least somewhat in the pessimistic scenario—unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.
Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.
There has already been research done on using multiple information sources to improve the accuracy of preference learning—Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.
Glad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of (potentially mistaken) normative assumptions you need in order to model a single human’s preferences.
The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we’re then trying to aggregate their preferences (with each human’s preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents’ ability to vote strategically as an opportunity to reach stable outcomes.
I talk about this idea here. As with using approval/actions to improve the elicitation of an individual’s preferences, you can’t avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you’re on your way to a full solution to ambitious value learning.
Yes, even on the ‘optimistic scenario’ we need external information of various kinds to ‘debias’. However, this external information can come from a human interacting with the AI, in the form of human approval of trajectories or actions taken or proposed by an AI agent, on the assumption that since our stated and revealed preferences diverge, there will sometimes be differences in what we approve of and what we do that are due solely to differences in bias.
This is still technically external to observing the human’s behaviour, but it is essentially a second input channel for information about human preferences and biases. This only works, of course, if humans tend to approve different things to the things that they actually do in a way influenced by bias (otherwise you have the same information as you’d get from actions, which helps with improving accuracy but not debiasing, see here), which is the case at least some of the time.
In other words, the beliefs and preferences are unchanged when the agent acts or approves but the ‘approval selector’ is different from the ‘action selector’ sometimes and, based on what does and does not change, you can try to infer what originated from legitimate beliefs and preferences and what originated from variation between the approval and action selector, which must be bias.
So a PCA performed on the approval would produce a mix of beliefs, preferences and (different) biases. Underlying preferences are, by specification, equally represented either by human actions or by human approval of actions taken (since no matter what they are your preferences), but many biases don’t exhibit this pattern—for example, we discount more over time in our revealed preferences than in our stated preferences. What we approve of typically represents a less (or at least differently) biased response than what we actually do.
There has already been research on combining information on reward models from multiple sources, to infer a better overall reward model but not as far as I know on specifically actions and approval as differently biased sources of information.
CIRL ought to extract our revealed preferences (since it’s based on behavioural policy) while a method like reinforcement learning from human preferences should extract our stated preferences—that might be a place to start, at least on validating that there actually are relevant differences caused by differently strong biases in our stated vs revealed preferences, and that the methods actually do end up with different policies.
The goal here would be to have some kind of ‘dual channel’ preference learner that extracts beliefs and preferences from biased actions and approval by examining what varies. I’m sure you’d still need labelling and explicit information about what counts as a bias, but there might need to be a lot less than with single information sources. How much less (how much extra information you get from such divergences) seems like an empirical question. Finding out how common divergences between stated and revealed preferences that actually influence the learned policies of agents designed to infer human preferences from actions vs approval are would be useful as a first step.
What I’ve suggested should still help at least somewhat in the pessimistic scenario—unless preferences/beliefs vary when you switch between looking at approval vs actions more than biases vary, you can still gain some information on underlying preferences and beliefs by seeing how approval and actions differ.
Of the difficult examples you gave, racial bias at least varies between actions and approval. Implementing different reward modelling algorithms and messing around with them to try and find ways to extract unbiased preferences from multiple information sources might be a useful research agenda.
There has already been research done on using multiple information sources to improve the accuracy of preference learning—Reward-rational implicit choice, but not specifically on using the divergences between different sources of information from the same agent to learn things about the agents unbiased preferences.
Thanks! Useful insights in your post, to mull over.
Glad you think so! I think that methods like using multiple information sources might be a useful way to reduce the number of (potentially mistaken) normative assumptions you need in order to model a single human’s preferences.
The other area of human preference learning where you seem, inevitably, to need a lot of strong normative assumptions is in preference aggregation. If we assume we have elicited the preferences of lots of individual humans, and we’re then trying to aggregate their preferences (with each human’s preference represented by a separate model) I think the same basic principle applies, that you can reduce the normative assumptions you need by using a more complicated voting mechanism, in this case one that considers agents’ ability to vote strategically as an opportunity to reach stable outcomes.
I talk about this idea here. As with using approval/actions to improve the elicitation of an individual’s preferences, you can’t avoid making any normative assumptions by using a more complicated aggregation method, but perhaps you end up having to make fewer of them. Very speculatively, if you can combine a robust method of eliciting preferences with few inbuilt assumptions with a similarly robust method of aggregating preferences, you’re on your way to a full solution to ambitious value learning.