(e.g., gpt-4 is far more useful and far more aligned than gpt-4-base), which is the opposite of what the ‘alignment tax’ model would have predicted.
Useful and aligned are, in this context, 2 measures of a similar thing. An AI that is just ignoring all your instructions is neither useful nor aligned.
What would a positive alignment tax look like.
It would look like a gpt-4-base being reluctant to work, but if you get the prompt just right and get lucky, it will sometimes display great competence.
If gpt-4-base sometimes spat out code that looked professional, but had some very subtle (and clearly not accidental) backdoor. (When it decided to output code, which it might or might not do at it’s whim).
While gpt-4-rlhf was reliably outputting novice looking code with no backdoors added.
Useful and aligned are, in this context, 2 measures of a similar thing. An AI that is just ignoring all your instructions is neither useful nor aligned.
What would a positive alignment tax look like.
It would look like a gpt-4-base being reluctant to work, but if you get the prompt just right and get lucky, it will sometimes display great competence.
If gpt-4-base sometimes spat out code that looked professional, but had some very subtle (and clearly not accidental) backdoor. (When it decided to output code, which it might or might not do at it’s whim).
While gpt-4-rlhf was reliably outputting novice looking code with no backdoors added.
That would be what an alignment tax looks like.