Thanks for the response. (Yeah, I think there’s some talking past each other going on.)
On further reflection, you are right about the update one should make about a “really hard to get it to stop being nice” experiment. I agree that it’s Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it’s also the case that “AI that is aligned half of the time isn’t aligned” is a relevant consideration, but as the saying goes, “both can be true”.)
Showing that nice behavior is hard to train out, would be bad news?
My point is not quite that it would be bad news overall, but bad news from the perspective of “how good are we at ensuring the behavior we want”.
I now notice that my language was ambiguous. (I edited it for clarity.) When I said “behavior we want”, I meant “given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?”, as opposed to “can we make the model behave according to human values”. And what I tried to claim was that it’s bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)
Thanks for the response. (Yeah, I think there’s some talking past each other going on.)
On further reflection, you are right about the update one should make about a “really hard to get it to stop being nice” experiment. I agree that it’s Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it’s also the case that “AI that is aligned half of the time isn’t aligned” is a relevant consideration, but as the saying goes, “both can be true”.)
My point is not quite that it would be bad news overall, but bad news from the perspective of “how good are we at ensuring the behavior we want”.
I now notice that my language was ambiguous. (I edited it for clarity.) When I said “behavior we want”, I meant “given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?”, as opposed to “can we make the model behave according to human values”. And what I tried to claim was that it’s bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)