I’d propose that RLHF matches the level of “outer alignment” humans have, which isn’t close to good enough even for ourselves. we have a lot more old “inner alignment”, though, resulting from genome self-friendliness within our species.
(inner/outer alignment are blurry intuitive words and are likely to collapse to a better representation under attempted systematization. I forget what post argues that)
I’d propose that RLHF matches the level of “outer alignment” humans have, which isn’t close to good enough even for ourselves. we have a lot more old “inner alignment”, though, resulting from genome self-friendliness within our species.
(inner/outer alignment are blurry intuitive words and are likely to collapse to a better representation under attempted systematization. I forget what post argues that)