Suppose we wanted the AI to be ideologically neutral and free from human biases, just telling the objective truth to the extent possible. Do you think achieving something like that would be possible in the longer term, and if so through what kinds of techniques?
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of “ideologically neutral”. (You’ll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of “ideologically neutral”.
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
You’ll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.
I’m less optimistic about this, given that complaints about Wikipedia’s left-wing bias seem common and credible to me.
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
Yes.
The reason I said “precise specification” is that if your guidelines are ambiguous, then you’re implicitly optimizing something like, “what labelers prefer on average, given the ambiguity”, but doing so in a less data-efficient way than if you had specified this target more precisely.
I think that should be possible with techniques like reinforcement learning from human feedback, for a given precise specification of “ideologically neutral”. (You’ll of course have a hard time convincing everyone that your specification is itself ideologically neutral, but projects like Wikipedia give me hope that we can achieve a reasonable amount of consensus.) There are still a number of challenging obstacles, including being able to correctly evaluate responses to difficult questions, collecting enough data while maintaining quality, and covering unusual or adversarially-selected edge cases.
What kind of specification do you have in mind? Is it like a set of guidelines for the human providing feedback on how to do it in an ideologically neutral way?
I’m less optimistic about this, given that complaints about Wikipedia’s left-wing bias seem common and credible to me.
Yes.
The reason I said “precise specification” is that if your guidelines are ambiguous, then you’re implicitly optimizing something like, “what labelers prefer on average, given the ambiguity”, but doing so in a less data-efficient way than if you had specified this target more precisely.