RLHF is not even an alignment method. RLHF is not even a control method. RLHF is a user interface feature. It was designed as such, and that’s what it can do.
I’m not going to discuss on whether RLHF actually works to deal with dangerous AIs, or whether it’s useless/safetywashing at best, but I’m pretty sure that RLHF was developed in part to create alignment techniques, and in part to model baseline alignment techniques more realistically, and regardless of how well the technique worked, I don’t think it would be correct to claim that it wasn’t an alignment technique, just that it’s ineffective/harmful.
Changed to “RLHF as actually implemented.” I’m aware of its theoretical origin story with Paul Christiano; I’m going a little “the purpose of a system is what it does”.
I am confused at this claim specifically:
I’m not going to discuss on whether RLHF actually works to deal with dangerous AIs, or whether it’s useless/safetywashing at best, but I’m pretty sure that RLHF was developed in part to create alignment techniques, and in part to model baseline alignment techniques more realistically, and regardless of how well the technique worked, I don’t think it would be correct to claim that it wasn’t an alignment technique, just that it’s ineffective/harmful.
Changed to “RLHF as actually implemented.” I’m aware of its theoretical origin story with Paul Christiano; I’m going a little “the purpose of a system is what it does”.