Nice article! Your main point, that capabilities and alignment can be and often are advanced together, is valuable, and I take it.
Now, to voice the definitional quibble I suspect many readers are thinking of:
I think it’s more correct to say that some techniques invented for alignment might also be used to improve capabilities, and the way they’re first implemented might do that by accident. Literally negative alignment taxes for a whole technique seem like it’s stretching the definition of capabilities and alignment.
For instance, the approach I propose in Internal independent review for language model agent alignment should hypothetically improve capabilities when it’s turned to that end, but will not if it’s used strictly for alignment. A better term for it, (coined by Shane Legg in this short talk), is System 2 alignment. It’s scripting the agent to “think through” the consequences of an action before taking it, like humans employ System 2 thinking for important actions. You could design it to think through the ethical consequences, or the efficacy or cost of an action, or any combination. Including checking the predicted ethical consequences will take slightly longer than checking only predicted efficacy and costs, and thus have a small but positive alignment tax.
The technique itself of implementing System 2 predictions for actions doesn’t have a negative alignment tax, just the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero. This technique seems to have been independently invented several times, often with alignment as the inspiration, so we could argue that working on alignment is also advancing capabilities here.
In the case of RLHF, we might even argue that the creation tax is negative; if you don’t specify what criteria people use for judging outputs, they’ll probably include both ethical (alignment) and helpful (capabilities). Differentiating these would be a bit harder. But the runtime cost seems like it’s guaranteed to be a positive tax. Refusing to do some unaligned stuff is a common complaint leveled at the system’s capabilities.
So:
I think the original definition of zero being the minimal alignment tax is probably correct. RLHF/RLAIF happen to both increase alignment and performance (by the metric of what people prefer), when they’re performed in a specific way. If you told the people or Constitution in charge of the RL process to not prefer harmless responses, just helpful ones, I very much doubt it would harm capabilities—I’d think it would help them (particularly from the perspective of people who’d like the capability of the AI saying naughty words.
Anyway, the main point that you can advance capabilities and alignment at the same time, and should think about differentially advancing alignment is well taken. I’d just change the framing in future pitches to this effect.
Thanks for this comment! Definitely take your point that it may be too simplistic to classify entire techniques as exhibiting a negative alignment tax when tweaking the implementation of that technique slightly could feasibly produce misaligned behavior. It does still seem like there might be a relevant distinction between:
Techniques that can be applied to improve either alignment or capabilities, depending on how they’re implemented. Your example of ‘System 2 alignment’ would fall into this category, as would any other method with “the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero,” as you put it.
Techniques that, by their very nature, improve both alignment and capabilities simultaneously, where the improvement in capabilities is not just a potential side effect or alternative application, but an integral part of how the technique functions. RLHF (for all of its shortcomings, as we note in the post) is probably the best concrete example of this—this is an alignment technique that is now used by all major labs (some of which seem to hardly care about alignment per se) by virtue of the fact it so clearly improves capabilities on balance.
(To this end, I think the point about refusing to do unaligned stuff as a lack of capability might be a stretch, as RLHF is much of what is driving the behavioral differences between, eg, gpt-4-base and gpt-4, which goes far beyond whether, to use your example, the model is using naughty words.)
We are definitely supportive of approaches that fall under both 1 and 2 (and acknowledge that 1-like approaches would not inherently have negative alignment taxes), but it does seem very likely that there are more undiscovered approaches out there with the general 2-like effect of “technique X got invented for safety reasons—and not only does it clearly help with alignment, but it also helps with other capabilities so much that, even as greedy capitalists, we have no choice but to integrate it into our AI’s architecture to remain competitive!” This seems like a real and entirely possible circumstance where we would want to say that technique X has a negative alignment tax.
Overall, we’re also sensitive to this all becoming a definitions dispute about what exactly is meant by terminology like ‘alignment taxes,’ ‘capabilities,’ etc, and the broader point that, as you put it,
you can advance capabilities and alignment at the same time, and should think about differentially advancing alignment
Nice article! Your main point, that capabilities and alignment can be and often are advanced together, is valuable, and I take it.
Now, to voice the definitional quibble I suspect many readers are thinking of:
I think it’s more correct to say that some techniques invented for alignment might also be used to improve capabilities, and the way they’re first implemented might do that by accident. Literally negative alignment taxes for a whole technique seem like it’s stretching the definition of capabilities and alignment.
For instance, the approach I propose in Internal independent review for language model agent alignment should hypothetically improve capabilities when it’s turned to that end, but will not if it’s used strictly for alignment. A better term for it, (coined by Shane Legg in this short talk), is System 2 alignment. It’s scripting the agent to “think through” the consequences of an action before taking it, like humans employ System 2 thinking for important actions. You could design it to think through the ethical consequences, or the efficacy or cost of an action, or any combination. Including checking the predicted ethical consequences will take slightly longer than checking only predicted efficacy and costs, and thus have a small but positive alignment tax.
The technique itself of implementing System 2 predictions for actions doesn’t have a negative alignment tax, just the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero. This technique seems to have been independently invented several times, often with alignment as the inspiration, so we could argue that working on alignment is also advancing capabilities here.
In the case of RLHF, we might even argue that the creation tax is negative; if you don’t specify what criteria people use for judging outputs, they’ll probably include both ethical (alignment) and helpful (capabilities). Differentiating these would be a bit harder. But the runtime cost seems like it’s guaranteed to be a positive tax. Refusing to do some unaligned stuff is a common complaint leveled at the system’s capabilities.
So:
I think the original definition of zero being the minimal alignment tax is probably correct. RLHF/RLAIF happen to both increase alignment and performance (by the metric of what people prefer), when they’re performed in a specific way. If you told the people or Constitution in charge of the RL process to not prefer harmless responses, just helpful ones, I very much doubt it would harm capabilities—I’d think it would help them (particularly from the perspective of people who’d like the capability of the AI saying naughty words.
Anyway, the main point that you can advance capabilities and alignment at the same time, and should think about differentially advancing alignment is well taken. I’d just change the framing in future pitches to this effect.
Thanks for this comment! Definitely take your point that it may be too simplistic to classify entire techniques as exhibiting a negative alignment tax when tweaking the implementation of that technique slightly could feasibly produce misaligned behavior. It does still seem like there might be a relevant distinction between:
Techniques that can be applied to improve either alignment or capabilities, depending on how they’re implemented. Your example of ‘System 2 alignment’ would fall into this category, as would any other method with “the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero,” as you put it.
Techniques that, by their very nature, improve both alignment and capabilities simultaneously, where the improvement in capabilities is not just a potential side effect or alternative application, but an integral part of how the technique functions. RLHF (for all of its shortcomings, as we note in the post) is probably the best concrete example of this—this is an alignment technique that is now used by all major labs (some of which seem to hardly care about alignment per se) by virtue of the fact it so clearly improves capabilities on balance.
(To this end, I think the point about refusing to do unaligned stuff as a lack of capability might be a stretch, as RLHF is much of what is driving the behavioral differences between, eg, gpt-4-base and gpt-4, which goes far beyond whether, to use your example, the model is using naughty words.)
We are definitely supportive of approaches that fall under both 1 and 2 (and acknowledge that 1-like approaches would not inherently have negative alignment taxes), but it does seem very likely that there are more undiscovered approaches out there with the general 2-like effect of “technique X got invented for safety reasons—and not only does it clearly help with alignment, but it also helps with other capabilities so much that, even as greedy capitalists, we have no choice but to integrate it into our AI’s architecture to remain competitive!” This seems like a real and entirely possible circumstance where we would want to say that technique X has a negative alignment tax.
Overall, we’re also sensitive to this all becoming a definitions dispute about what exactly is meant by terminology like ‘alignment taxes,’ ‘capabilities,’ etc, and the broader point that, as you put it,
is indeed a good key general takeaway.