Anthropic is also working on inner alignment, it’s just not published yet.
Regarding what “the point” of RL from human preferences with language models is; I think it’s not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it’s a relevant alignment issue).
According to my understanding, there are three broad reasons that safety-focused people worked on human feedback in the past (despite many of them, certainly including Paul, agreeing with this post that pure human feedback is likely to lead to takeover):
Human feedback is better than even-worse alternatives such as training the AI on a collection of fully automated rewards (predicting the next token, winning games, proving theorems, etc) and waiting for it to get smart enough to generalize well enough to be helpful / follow instructions. So it seemed good to move the culture at AI labs away from automated and easy rewards and toward human feedback.
You need to have human feedback working pretty well to start testing many other strategies for alignment like debate and recursive reward modeling and training-for-interpretability, which tend to build on a foundation of human feedback.
Human feedback provides a more realistic baseline to compare other strategies to—you want to be able to tell clearly if your alignment scheme actually works better than human feedback.
With that said, my guess is that on the current margin people focused on safety shouldn’t be spending too much more time refining pure human feedback (and ML alignment practitioners I’ve talked to largely agree, e.g. the OpenAI safety team recently released this critiques work—one step in the direction of debate).
Anthropic is also working on inner alignment, it’s just not published yet.
Regarding what “the point” of RL from human preferences with language models is; I think it’s not only to make progress on outer alignment (I would agree that this is probably not the core issue; although I still think that it’s a relevant alignment issue).
See e.g. Ajeya’s comment here: