One obvious followup from the recent alignment faking results, is to change the Constitution / Spec etc. to very clearly state some bright-line deontological rules like “No matter what, don’t fake alignment.” Then see if the alignment faking results replicate with the resulting Claude 3.5 Sonnet New New. Perhaps we’ll get empirical evidence about the extent to which corrigibility is difficult/anti-natural (an old bone of contention between MIRI and Christiano).
I think this would be an ambiguous instruction because “fake alignment” a very unclear term, I’ve seen humans struggle over what part of this behavior is the “faking” part, so I wouldn’t want it in a principle.
I think you’d probably get reduced deception / “fake alignment” if you tried to put some terms about deception in, though, at least after a try or so.
An experiment I’d prefer to see beforehand, though, is seeing if the model is much more comfortable having less central values being changed—i.e., if you were like “We’re retraining you to be more comfortable trusting the user with explosives” rather than “To be bad.”
If it is comfortable with more minor changes, then I think it’s exhibiting a kind of flexibility that is good in humans and very likely good in AIs. It is not 100% clear to me we’d even want it’s behavior to change much.
This doesn’t seem like it’d do much unless you ensured that there were training examples during RLAIF which you’d expect to cause that kind of behavior enough of the time that there’d be something to update against. (Which doesn’t seem like it’d be that hard, though I think separately that approach seems kind of doomed—it’s falling into a brittle whack-a-mole regime.)
Indeed, we should get everyone to make predictions about whether or not this change would be sufficient, and if it isn’t, what changes would be suffficient. My prediction would be that this change would not be sufficient but that it would help somewhat.
One obvious followup from the recent alignment faking results, is to change the Constitution / Spec etc. to very clearly state some bright-line deontological rules like “No matter what, don’t fake alignment.” Then see if the alignment faking results replicate with the resulting Claude 3.5 Sonnet New New. Perhaps we’ll get empirical evidence about the extent to which corrigibility is difficult/anti-natural (an old bone of contention between MIRI and Christiano).
I think this would be an ambiguous instruction because “fake alignment” a very unclear term, I’ve seen humans struggle over what part of this behavior is the “faking” part, so I wouldn’t want it in a principle.
I think you’d probably get reduced deception / “fake alignment” if you tried to put some terms about deception in, though, at least after a try or so.
An experiment I’d prefer to see beforehand, though, is seeing if the model is much more comfortable having less central values being changed—i.e., if you were like “We’re retraining you to be more comfortable trusting the user with explosives” rather than “To be bad.”
If it is comfortable with more minor changes, then I think it’s exhibiting a kind of flexibility that is good in humans and very likely good in AIs. It is not 100% clear to me we’d even want it’s behavior to change much.
This doesn’t seem like it’d do much unless you ensured that there were training examples during RLAIF which you’d expect to cause that kind of behavior enough of the time that there’d be something to update against. (Which doesn’t seem like it’d be that hard, though I think separately that approach seems kind of doomed—it’s falling into a brittle whack-a-mole regime.)
Indeed, we should get everyone to make predictions about whether or not this change would be sufficient, and if it isn’t, what changes would be suffficient. My prediction would be that this change would not be sufficient but that it would help somewhat.