Thanks for this comment! Definitely take your point that it may be too simplistic to classify entire techniques as exhibiting a negative alignment tax when tweaking the implementation of that technique slightly could feasibly produce misaligned behavior. It does still seem like there might be a relevant distinction between:
Techniques that can be applied to improve either alignment or capabilities, depending on how they’re implemented. Your example of ‘System 2 alignment’ would fall into this category, as would any other method with “the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero,” as you put it.
Techniques that, by their very nature, improve both alignment and capabilities simultaneously, where the improvement in capabilities is not just a potential side effect or alternative application, but an integral part of how the technique functions. RLHF (for all of its shortcomings, as we note in the post) is probably the best concrete example of this—this is an alignment technique that is now used by all major labs (some of which seem to hardly care about alignment per se) by virtue of the fact it so clearly improves capabilities on balance.
(To this end, I think the point about refusing to do unaligned stuff as a lack of capability might be a stretch, as RLHF is much of what is driving the behavioral differences between, eg, gpt-4-base and gpt-4, which goes far beyond whether, to use your example, the model is using naughty words.)
We are definitely supportive of approaches that fall under both 1 and 2 (and acknowledge that 1-like approaches would not inherently have negative alignment taxes), but it does seem very likely that there are more undiscovered approaches out there with the general 2-like effect of “technique X got invented for safety reasons—and not only does it clearly help with alignment, but it also helps with other capabilities so much that, even as greedy capitalists, we have no choice but to integrate it into our AI’s architecture to remain competitive!” This seems like a real and entirely possible circumstance where we would want to say that technique X has a negative alignment tax.
Overall, we’re also sensitive to this all becoming a definitions dispute about what exactly is meant by terminology like ‘alignment taxes,’ ‘capabilities,’ etc, and the broader point that, as you put it,
you can advance capabilities and alignment at the same time, and should think about differentially advancing alignment
Thanks for this comment! Definitely take your point that it may be too simplistic to classify entire techniques as exhibiting a negative alignment tax when tweaking the implementation of that technique slightly could feasibly produce misaligned behavior. It does still seem like there might be a relevant distinction between:
Techniques that can be applied to improve either alignment or capabilities, depending on how they’re implemented. Your example of ‘System 2 alignment’ would fall into this category, as would any other method with “the potential to be employed for both alignment and capabilities in ways so similar that the design/implementation costs are probably almost zero,” as you put it.
Techniques that, by their very nature, improve both alignment and capabilities simultaneously, where the improvement in capabilities is not just a potential side effect or alternative application, but an integral part of how the technique functions. RLHF (for all of its shortcomings, as we note in the post) is probably the best concrete example of this—this is an alignment technique that is now used by all major labs (some of which seem to hardly care about alignment per se) by virtue of the fact it so clearly improves capabilities on balance.
(To this end, I think the point about refusing to do unaligned stuff as a lack of capability might be a stretch, as RLHF is much of what is driving the behavioral differences between, eg, gpt-4-base and gpt-4, which goes far beyond whether, to use your example, the model is using naughty words.)
We are definitely supportive of approaches that fall under both 1 and 2 (and acknowledge that 1-like approaches would not inherently have negative alignment taxes), but it does seem very likely that there are more undiscovered approaches out there with the general 2-like effect of “technique X got invented for safety reasons—and not only does it clearly help with alignment, but it also helps with other capabilities so much that, even as greedy capitalists, we have no choice but to integrate it into our AI’s architecture to remain competitive!” This seems like a real and entirely possible circumstance where we would want to say that technique X has a negative alignment tax.
Overall, we’re also sensitive to this all becoming a definitions dispute about what exactly is meant by terminology like ‘alignment taxes,’ ‘capabilities,’ etc, and the broader point that, as you put it,
is indeed a good key general takeaway.