I agree with your points about avoiding political polarisation and allowing people with different ideological positions to collaborate on alignment. I’m not sure about the idea that aligning to a single group’s values (or to a coherent ideology) is technically easier than a more vague ‘align to humanity’s values’ goal.
Groups rarely have clearly articulated ideologies—more like vibes which everyone normally gets behind. An alignment approach from clearly spelling out what you consider valuable doesn’t seem likely to work. Looking to existing models which have been aligned to some degree through safety testing, the work doesn’t take the form of injecting a clear structured value set. Instead, large numbers of people with differing opinions and world views continually correct the system until it generally behaves itself. This seems far more pluralistic than ‘alignment to one group’ suggests.
This comes with the caveat that these systems are created and safety tested by people with highly abnormal attitudes when compared to the rest of their species. But sourcing viewpoints from outside seems to be an organisational issue rather than a technical one.
I think creating real AGI based on an LLM aligned to be helpful, harmless and honest would probably be the end of us, as carrying the set of value implied by RLHF to their logical conclusions outside of human control would probably be pretty different from our desired values. Instruction-following provides corrigibililty.
Edit: by “’small group” I meant something like five people who are authorized to give insntructions to an AGI.
I agree with your points about avoiding political polarisation and allowing people with different ideological positions to collaborate on alignment. I’m not sure about the idea that aligning to a single group’s values (or to a coherent ideology) is technically easier than a more vague ‘align to humanity’s values’ goal.
Groups rarely have clearly articulated ideologies—more like vibes which everyone normally gets behind. An alignment approach from clearly spelling out what you consider valuable doesn’t seem likely to work. Looking to existing models which have been aligned to some degree through safety testing, the work doesn’t take the form of injecting a clear structured value set. Instead, large numbers of people with differing opinions and world views continually correct the system until it generally behaves itself. This seems far more pluralistic than ‘alignment to one group’ suggests.
This comes with the caveat that these systems are created and safety tested by people with highly abnormal attitudes when compared to the rest of their species. But sourcing viewpoints from outside seems to be an organisational issue rather than a technical one.
I agree with everything you’ve said. The advantages are primarily from not aligning to values but only to following instructions rather than using RL or any other process to infer underlying values. Instruction-following AGI is easier and more likely than value aligned AGI.
I think creating real AGI based on an LLM aligned to be helpful, harmless and honest would probably be the end of us, as carrying the set of value implied by RLHF to their logical conclusions outside of human control would probably be pretty different from our desired values. Instruction-following provides corrigibililty.
Edit: by “’small group” I meant something like five people who are authorized to give insntructions to an AGI.