Noosphere89 comments on A metaphor: what “green lights” for AGI would look like

Noosphere89 24 Oct 2024 0:42 UTC
2 points
0
I am confused at this claim specifically:
RLHF is not even an alignment method. RLHF is not even a control method. RLHF is a user interface feature. It was designed as such, and that’s what it can do.
I’m not going to discuss on whether RLHF actually works to deal with dangerous AIs, or whether it’s useless/safetywashing at best, but I’m pretty sure that RLHF was developed in part to create alignment techniques, and in part to model baseline alignment techniques more realistically, and regardless of how well the technique worked, I don’t think it would be correct to claim that it wasn’t an alignment technique, just that it’s ineffective/harmful.
- Lorec 24 Oct 2024 1:14 UTC
  1 point
  0
  Parent
  Changed to “RLHF as actually implemented.” I’m aware of its theoretical origin story with Paul Christiano; I’m going a little “the purpose of a system is what it does”.