RLHF and Fine-Tuning havenot workedwell so far. Models are often unhelpful, untruthful, inconsistent, in many ways that had been theorized in the past. We also witness goal misspecification, misalignment, etc. Worse than this, as models become more powerful, we expect more egregious instances of misalignment, as more optimization will push for more and more extreme edge cases and pseudo-adversarial examples.
These three links are:
The first is Mysteries of mode collapse, which claims that RLHF (as well as OpenAI’s supervised fine-tuning on highly-rated responses) decreases entropy. This doesn’t seem particularly related to any of the claims in this paragraph, and I haven’t seen it explained why this is a bad thing.
The second is Discovering language model behaviors with model-written evaluations and shows that Anthropic’s models trained with RLHF have systematically different personalities than the pre-trained model. I’m not exactly sure what claims you are citing, but I think it probably involves some big leaps to interpret this as either directly harmful or connected with traditional stories about risk.
The third is Compendium of problems with RLHF, which primarily links to the previous 2 failures and then discusses theoretical limitations.
I think these are bad citations for the claim that methods are “not working well” or that current evidence points towards trouble.
The current problems you list—”unhelpful, untruthful, and inconsistent”—don’t seem like good examples to illustrate your point. These are mostly caused by models failing to correctly predict which responses a human would rate highly. That happens because models have limited capabilities and is rapidly improving as models get smarter. These are not the problems that most people in the community are worried about, and I think it’s misleading to say this is what was “theorized” in the past.
I think RLHF is obviously inadequate for aligning really powerful models, both because you cannot effectively constrain a deceptively aligned model and because human evaluators will eventually not be able to understand the consequences of proposed actions. And I think it is very plausible that large language models will pose serious catastrophic risks from misalignment before they are transformative (it seems very hard to tell). But I feel like this post isn’t engaging with the substance of those concerns or sensitive to the actual state of evidence about how severe the problem looks like it will be or how well existing mitigations might work.
Agree that the cited links don’t represent a strong criticism of RLHF but I think there’s an interesting implied criticism, between the mode-collapse post and janus’ other writings on cyborgism etc that I haven’t seen spelled out, though it may well be somewhere.
I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF’d ones. If true, we’re paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers around base models in the vein of Loom.
I guess the test (best done without too much fanfare) would be to get a few people well acquainted with Loom or whichever wrapper tool and identify a few complex tasks and see whether the base model or the RLHF model performs better.
Even if true though, I don’t think it’s really a mark against RLHF since it’s still likely that RLHF makes outputs safer for the vast majority of users, just that if we think we’re in an ideas arms-race with people trying to advance capabilities, we can’t expect everyone to be using RLHF’d models.
These three links are:
The first is Mysteries of mode collapse, which claims that RLHF (as well as OpenAI’s supervised fine-tuning on highly-rated responses) decreases entropy. This doesn’t seem particularly related to any of the claims in this paragraph, and I haven’t seen it explained why this is a bad thing.
The second is Discovering language model behaviors with model-written evaluations and shows that Anthropic’s models trained with RLHF have systematically different personalities than the pre-trained model. I’m not exactly sure what claims you are citing, but I think it probably involves some big leaps to interpret this as either directly harmful or connected with traditional stories about risk.
The third is Compendium of problems with RLHF, which primarily links to the previous 2 failures and then discusses theoretical limitations.
I think these are bad citations for the claim that methods are “not working well” or that current evidence points towards trouble.
The current problems you list—”unhelpful, untruthful, and inconsistent”—don’t seem like good examples to illustrate your point. These are mostly caused by models failing to correctly predict which responses a human would rate highly. That happens because models have limited capabilities and is rapidly improving as models get smarter. These are not the problems that most people in the community are worried about, and I think it’s misleading to say this is what was “theorized” in the past.
I think RLHF is obviously inadequate for aligning really powerful models, both because you cannot effectively constrain a deceptively aligned model and because human evaluators will eventually not be able to understand the consequences of proposed actions. And I think it is very plausible that large language models will pose serious catastrophic risks from misalignment before they are transformative (it seems very hard to tell). But I feel like this post isn’t engaging with the substance of those concerns or sensitive to the actual state of evidence about how severe the problem looks like it will be or how well existing mitigations might work.
A new paper, built upon the compendium of problems with RLHF, tries to make an exhaustive list of all the issues identified so far: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
Agree that the cited links don’t represent a strong criticism of RLHF but I think there’s an interesting implied criticism, between the mode-collapse post and janus’ other writings on cyborgism etc that I haven’t seen spelled out, though it may well be somewhere.
I see janus as saying that if you know how to properly use the raw models, then you can actually get much more useful work out of the raw models than the RLHF’d ones. If true, we’re paying a significant alignment tax with RLHF that will only become clear with the improvement and take-up of wrappers around base models in the vein of Loom.
I guess the test (best done without too much fanfare) would be to get a few people well acquainted with Loom or whichever wrapper tool and identify a few complex tasks and see whether the base model or the RLHF model performs better.
Even if true though, I don’t think it’s really a mark against RLHF since it’s still likely that RLHF makes outputs safer for the vast majority of users, just that if we think we’re in an ideas arms-race with people trying to advance capabilities, we can’t expect everyone to be using RLHF’d models.