I constructed my AI alignment research agenda piece by piece, stumbling around in the dark and going down many false and true avenues.
But now it is increasingly starting to feel natural to me, and indeed, somewhat inevitable.
What do I mean with that? Well, let’s look at the problem in reverse. Suppose we had an AI that was aligned with human values/preferences. How would you expect that to have been developed? I see four natural paths:
Effective proxy methods. For example, Paul’s amplification and distillation, or variants of revealed preferences, or a similar approach. The point of this that it reaches alignment without defining what a preference fundamentally is; instead it uses some proxy for the preference to do the job.
Corrigibility: the AI is safe and corrigible, and along with active human guidance, manages to reach a tolerable outcome.
Something new: a bold new method that works, for reasons we haven’t thought of today (this includes most strains of moral realism).
An actual grounded definition of human preferences.
So, if we focus on scenario 4, we need a few things. We need a fundamental definition of what a human preference is (since we know this can’t be defined purely from behaviour). We need a method of combining contradictory and underdefined human preferences. We also need a method for taking into account human meta-preferences. And both these methods has to actually reach an output, and not get caught in loops.
If those are the requirements, then it’s obvious why we need most of the elements of my research agenda, or something similar. We don’t need the exact methods sketched out there, there may be other way of synthesising preferences and meta-preferences together. But the overall structure—a way of defining preferences, and ways of combining them that produce an output—seems, in retrospect, inevitable. The rest is, to some extent, just implementation details.
Research Agenda in reverse: what *would* a solution look like?
I constructed my AI alignment research agenda piece by piece, stumbling around in the dark and going down many false and true avenues.
But now it is increasingly starting to feel natural to me, and indeed, somewhat inevitable.
What do I mean with that? Well, let’s look at the problem in reverse. Suppose we had an AI that was aligned with human values/preferences. How would you expect that to have been developed? I see four natural paths:
Effective proxy methods. For example, Paul’s amplification and distillation, or variants of revealed preferences, or a similar approach. The point of this that it reaches alignment without defining what a preference fundamentally is; instead it uses some proxy for the preference to do the job.
Corrigibility: the AI is safe and corrigible, and along with active human guidance, manages to reach a tolerable outcome.
Something new: a bold new method that works, for reasons we haven’t thought of today (this includes most strains of moral realism).
An actual grounded definition of human preferences.
So, if we focus on scenario 4, we need a few things. We need a fundamental definition of what a human preference is (since we know this can’t be defined purely from behaviour). We need a method of combining contradictory and underdefined human preferences. We also need a method for taking into account human meta-preferences. And both these methods has to actually reach an output, and not get caught in loops.
If those are the requirements, then it’s obvious why we need most of the elements of my research agenda, or something similar. We don’t need the exact methods sketched out there, there may be other way of synthesising preferences and meta-preferences together. But the overall structure—a way of defining preferences, and ways of combining them that produce an output—seems, in retrospect, inevitable. The rest is, to some extent, just implementation details.