So I agree with Paul’s comment that there’s another motivation for work on preference learning besides the two you identify. But even if I take on what I believe on your views of the risks, it seems like there is something very close to preference learning that is still helpful to existential safety. I have sometimes called it the specification problem: given a desired behavior, how do you provide a training signal to an AI system such that it is incentivized to behave that way? Typical approaches include imitation learning, learning from comparisons / preferences, learning from corrections, etc.
Before I explain why I think this should be useful even on your views, let me try clarifying the field more. Looking at the papers you list as exemplars of PL:
I think the first, second, fourth and sixth are clear exemplars of work tackling the specification problem (whether or not the authors would put it that way themselves). The third is unclear (I wouldn’t have put it in PL, nor with the specification problem, though I might be forgetting what’s in it). The fifth is mostly PL and less about the specification problem; I am less excited about that paper as a result.
Okay, so why should this be useful even on (my model of) your views? You say that you want to anticipate, legitimize and fulfill governance demands. I see the combination of <specification problem field> and OODR as one of the best ways of fulfilling governance demands (which can then be used to legitimize them in advance, if you are able to anticipate them). In particular, most governance demands will look like “please make your AI systems satisfy property P”, where P is some phrase in natural language that’s fuzzy and can’t immediately be grounded (for example, fairness). It seems to me that given such a demand, a natural way of solving it is to figure out which behaviors do and don’t satisfy P, and then use your solutions to the specification problem to incentivize your AI system to satisfy P, and then use OODR to ensure that they actually satisfy P in all situations. I expect this to work in the next decade to e.g. ensure that natural language systems almost never deceive people into thinking they are human.
So I agree with Paul’s comment that there’s another motivation for work on preference learning besides the two you identify. But even if I take on what I believe on your views of the risks, it seems like there is something very close to preference learning that is still helpful to existential safety. I have sometimes called it the specification problem: given a desired behavior, how do you provide a training signal to an AI system such that it is incentivized to behave that way? Typical approaches include imitation learning, learning from comparisons / preferences, learning from corrections, etc.
Before I explain why I think this should be useful even on your views, let me try clarifying the field more. Looking at the papers you list as exemplars of PL:
I think the first, second, fourth and sixth are clear exemplars of work tackling the specification problem (whether or not the authors would put it that way themselves). The third is unclear (I wouldn’t have put it in PL, nor with the specification problem, though I might be forgetting what’s in it). The fifth is mostly PL and less about the specification problem; I am less excited about that paper as a result.
Okay, so why should this be useful even on (my model of) your views? You say that you want to anticipate, legitimize and fulfill governance demands. I see the combination of <specification problem field> and OODR as one of the best ways of fulfilling governance demands (which can then be used to legitimize them in advance, if you are able to anticipate them). In particular, most governance demands will look like “please make your AI systems satisfy property P”, where P is some phrase in natural language that’s fuzzy and can’t immediately be grounded (for example, fairness). It seems to me that given such a demand, a natural way of solving it is to figure out which behaviors do and don’t satisfy P, and then use your solutions to the specification problem to incentivize your AI system to satisfy P, and then use OODR to ensure that they actually satisfy P in all situations. I expect this to work in the next decade to e.g. ensure that natural language systems almost never deceive people into thinking they are human.