Seth Herd comments on Making a conservative case for alignment

Seth Herd 17 Nov 2024 14:13 UTC
4 points
2
I agree with everything you’ve said. The advantages are primarily from not aligning to values but only to following instructions rather than using RL or any other process to infer underlying values. Instruction-following AGI is easier and more likely than value aligned AGI.
I think creating real AGI based on an LLM aligned to be helpful, harmless and honest would probably be the end of us, as carrying the set of value implied by RLHF to their logical conclusions outside of human control would probably be pretty different from our desired values. Instruction-following provides corrigibililty.
Edit: by “’small group” I meant something like five people who are authorized to give insntructions to an AGI.