So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:
With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
Create ‘proxy agents’ using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.
This seems like a reasonable procedure for extending a method that is aligned to one human’s preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.
Even though we can’t yet align an AGI to one human’s preferences, trying to think about how to aggregate human preferences in a way that is scalable isn’t premature, as has sometimes been claimed.
In many ‘non-ambitious’ hypothetical settings where we aren’t trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn’t approach the question of preference aggregation from a ‘final’ ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.
However, if you did want to use such a method to try and produce the fabled ‘final utility function of all humanity’, it might not give you Humanity’s CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.
So to sum up, a very high-level summary of the steps in this method of preference elicitation and aggregation would be:
With a mixture of normative assumptions and multi-channel information (approval and actions) as inputs, use a reward-modelling method to elicit the debiased preferences of many individuals.
Determining whether there actually are significant differences between stated and revealed preferences when performing reward modelling is the first step to using multi-channel information to effectively separate biases from preferences.
Create ‘proxy agents’ using the reward model developed for each human (this step is where intent-aligned amplification can potentially occur).
Place the proxies in an iterated voting situation which tends to produce sensible convergent results. The use of RL proxies here can be compared to the use of human proxies in liquid democracy.
Which voting mechanisms tend to work in iterated situations with RL agents can be determined in other experiments (probably with purely artificial agents)
Run the voting mechanism until an unambiguous winner is decided, using methods like those given in this paper.
This seems like a reasonable procedure for extending a method that is aligned to one human’s preferences (step 1,2) to produce sensible results when trying to align to an aggregate of human preferences (step 3,4). It reduces reliance on the specific features of one voting method, Other than the insight that multiple channels of information might help, all the standard unsolved problems with preference learning from one human remain.
Even though we can’t yet align an AGI to one human’s preferences, trying to think about how to aggregate human preferences in a way that is scalable isn’t premature, as has sometimes been claimed.
In many ‘non-ambitious’ hypothetical settings where we aren’t trying to build an AGI sovereign over the whole world (for example, designing a powerful AI to govern the operations of a hospital), we still need to be able to aggregate preferences sensibly and stably. This method would do well at such intermediate scales, as it doesn’t approach the question of preference aggregation from a ‘final’ ambitious value-learning perspective but instead tries to look at preference aggregation the same way we look at elicitation, with an RL-based iterative approach to reaching a result.
However, if you did want to use such a method to try and produce the fabled ‘final utility function of all humanity’, it might not give you Humanity’s CEV, since some normative assumptions (preferences count equally and in the way given by the voting mechanism), are built in. By analogy with CEV, I called the idealized result of this method a coherent extrapolated framework (CEF). This is a more normatively direct method of aggregating values than CEV, (since you fix a particular method of aggregating preferences in advance), as it extrapolates from a voting framework rather than extrapolating based on our volition, more broadly (and vaguely) defined, hence CEF.