RLHF makes alignment worse by covering up any alignment failures. Insofar as it is economically favored, that would be a huge positive alignment tax, rather than a negative alignment tax.
I think the post makes clear that we agree RLHF is far from perfect and won’t scale, and definitely appreciate the distinction between building aligned models from the ground-up vs. ‘bolting on’ safety techniques (like RLHF) after the fact.
However, if you are referring to current models, the claim that RLHF makes alignment worse (as compared to a world where we simply forego doing RLHF?) seems empirically false.
However, if you are referring to current models, the claim that RLHF makes alignment worse (as compared to a world where we simply forego doing RLHF?) seems empirically false.
You talk like “alignment” is a traits that a model might have to varying extents, but really there is the alignment problem that AI will unleash a giant wave of stuff and we have reasons to believe this giant wave will crash into human society and destroy it. A solution to the alignment problem constitutes some way of aligning the wave to promote human flourishing instead of destroying society.
RLHF makes models avoid taking actions that humans can recognize as bad. If you model the giant wave as being caused by a latent “alignment” trait that a model have, which can be observed from whether it takes recognizably-bad actions, then RLHF will almost definitionally make you estimate this trait to be very very low.
But that model is not actually true and so your estimate of the “alignment” trait has nothing to do with how we’re doing with the alignment problem. On the other hand, the fact that RLHF makes you estimate that we’re doing well means that you have lost track of solving the alignment problem, which means we are one man down. Unless your contribution to solving the alignment problem would otherwise have been counterproductive/unhelpful, this means RLHF has made the situation with the alignment problem worse.
Let’s take a better way to estimate progress: Spambots. We don’t want them around (human values), but they pop up to earn money (instrumental convergence). You can use RLHF to make an AI to identify and remove spambots, for instance by giving it moderator powers on a social media website and evaluating its chains of thought, and you can use RLHF to make spambots, for instance by having people rate how human its text looks and how much it makes them want to buy products/fall for scams/whatever. I think it’s generally agreed that the latter is easier than the former.
Of course spambots aren’t the only thing human values care about. We also care about computers being able to solve cognitive tasks for us, and you are right that computers will be better able to solve cognitive tasks for us if we RLHF them. But this is a general characteristic of capabilities advances.
But a serious look at how RLHF improves the alignment problem doesn’t treat “alignment” as a latent trait of piles of matrices that can be observed for fragmented actions, rather it starts by looking at what effects RLHF lets AI models have on society, and then asks whether these effects are good or bad. So far, all the cases for RLHF show that RLHF makes an AI more economically valuable, which incentivizes producing more AI, but as long as the alignment problem is relevant, this makes the alignment problem worse rather than better.
Thanks for writing up your thoughts on RLHF in this separate piece, particularly the idea that ‘RLHF hides whatever problems the people who try to solve alignment could try to address.’ We definitely agree with most of the thrust of what you wrote here and do not believe nor (attempt to) imply anywhere that RLHF indicates that we are globally ‘doing well’ on alignment or have solved alignment or that alignment is ‘quite tractable.’ We explicitly do not think this and say so in the piece.
With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)
The broader point we are making in this post is that the entire world is moving full steam ahead towards more powerful AI whether we like it or not, and so discovering and deploying alignment techniques that move in the direction of actually satisfying this impossible-to-ignore attractor while also maximally decreasing the probability that the “giant wave will crash into human society and destroy it” seem worth pursuing—especially compared to the very plausible counterfactual world where everyone just pushes ahead with capabilities anyways without any corresponding safety guarantees.
While we do point to RLHF in the piece as one nascent example of what this sort of thing might look like, we think the space of possible approaches with a negative alignment tax is potentially vast. One such example we are particularly interested in (unlike RLHF) is related to implicit/explicit utility function overlap, mentioned in this comment.
With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)
I would like to see the case for catastrophic misuse being an xrisk, since it mostly seems like business-as-usual for technological development (you get more capabilities which means more good stuff but also more bad stuff).
It seems fairly clear that widely deployed, highly capable AI systems enabling unrestricted access to knowledge about weapons development, social manipulation techniques, coordinated misinformation campaigns, engineered pathogens, etc. could pose a serious threat. Bad actors using that information at scale could potentially cause societal collapse even if the AI itself was not agentic or misaligned in the way we usually think about with existential risk.
RLHF makes alignment worse by covering up any alignment failures. Insofar as it is economically favored, that would be a huge positive alignment tax, rather than a negative alignment tax.
I think the post makes clear that we agree RLHF is far from perfect and won’t scale, and definitely appreciate the distinction between building aligned models from the ground-up vs. ‘bolting on’ safety techniques (like RLHF) after the fact.
However, if you are referring to current models, the claim that RLHF makes alignment worse (as compared to a world where we simply forego doing RLHF?) seems empirically false.
You talk like “alignment” is a traits that a model might have to varying extents, but really there is the alignment problem that AI will unleash a giant wave of stuff and we have reasons to believe this giant wave will crash into human society and destroy it. A solution to the alignment problem constitutes some way of aligning the wave to promote human flourishing instead of destroying society.
RLHF makes models avoid taking actions that humans can recognize as bad. If you model the giant wave as being caused by a latent “alignment” trait that a model have, which can be observed from whether it takes recognizably-bad actions, then RLHF will almost definitionally make you estimate this trait to be very very low.
But that model is not actually true and so your estimate of the “alignment” trait has nothing to do with how we’re doing with the alignment problem. On the other hand, the fact that RLHF makes you estimate that we’re doing well means that you have lost track of solving the alignment problem, which means we are one man down. Unless your contribution to solving the alignment problem would otherwise have been counterproductive/unhelpful, this means RLHF has made the situation with the alignment problem worse.
Let’s take a better way to estimate progress: Spambots. We don’t want them around (human values), but they pop up to earn money (instrumental convergence). You can use RLHF to make an AI to identify and remove spambots, for instance by giving it moderator powers on a social media website and evaluating its chains of thought, and you can use RLHF to make spambots, for instance by having people rate how human its text looks and how much it makes them want to buy products/fall for scams/whatever. I think it’s generally agreed that the latter is easier than the former.
Of course spambots aren’t the only thing human values care about. We also care about computers being able to solve cognitive tasks for us, and you are right that computers will be better able to solve cognitive tasks for us if we RLHF them. But this is a general characteristic of capabilities advances.
But a serious look at how RLHF improves the alignment problem doesn’t treat “alignment” as a latent trait of piles of matrices that can be observed for fragmented actions, rather it starts by looking at what effects RLHF lets AI models have on society, and then asks whether these effects are good or bad. So far, all the cases for RLHF show that RLHF makes an AI more economically valuable, which incentivizes producing more AI, but as long as the alignment problem is relevant, this makes the alignment problem worse rather than better.
Thanks for writing up your thoughts on RLHF in this separate piece, particularly the idea that ‘RLHF hides whatever problems the people who try to solve alignment could try to address.’ We definitely agree with most of the thrust of what you wrote here and do not believe nor (attempt to) imply anywhere that RLHF indicates that we are globally ‘doing well’ on alignment or have solved alignment or that alignment is ‘quite tractable.’ We explicitly do not think this and say so in the piece.
With this being said, catastrophic misuse could wipe us all out, too. It seems too strong to say that the ‘traits’ of frontier models ‘[have] nothing to do’ with the alignment problem/whether a giant AI wave destroys human society, as you put it. If we had no alignment technique that reliably prevented frontier LLMs from explaining to anyone who asked how to make anthrax, build bombs, spread misinformation, etc, this would definitely at least contribute to a society-destroying wave. But finetuned frontier models do not do this by default, largely because of techniques like RLHF. (Again, not saying or implying RLHF achieves this perfectly or can’t be easily removed or will scale, etc. Just seems like a plausible counterfactual world we could be living in but aren’t because of ‘traits’ of frontier models.)
The broader point we are making in this post is that the entire world is moving full steam ahead towards more powerful AI whether we like it or not, and so discovering and deploying alignment techniques that move in the direction of actually satisfying this impossible-to-ignore attractor while also maximally decreasing the probability that the “giant wave will crash into human society and destroy it” seem worth pursuing—especially compared to the very plausible counterfactual world where everyone just pushes ahead with capabilities anyways without any corresponding safety guarantees.
While we do point to RLHF in the piece as one nascent example of what this sort of thing might look like, we think the space of possible approaches with a negative alignment tax is potentially vast. One such example we are particularly interested in (unlike RLHF) is related to implicit/explicit utility function overlap, mentioned in this comment.
I would like to see the case for catastrophic misuse being an xrisk, since it mostly seems like business-as-usual for technological development (you get more capabilities which means more good stuff but also more bad stuff).
It seems fairly clear that widely deployed, highly capable AI systems enabling unrestricted access to knowledge about weapons development, social manipulation techniques, coordinated misinformation campaigns, engineered pathogens, etc. could pose a serious threat. Bad actors using that information at scale could potentially cause societal collapse even if the AI itself was not agentic or misaligned in the way we usually think about with existential risk.
Wrote it up in longer form.