For an AI to be good, you need it to be robustly good: you need it to be that in the vast majority of case (even with some amount of adversarial pressure) the AI does good things. AI that is aligned half of the time isn’t aligned.
No, this doesn’t seem very symmetry breaking and it doesn’t invalidate my point. The hypothetical experiment would still be Bayesian evidence that alignment is extremely stable; just not total evidence (because the alignment wasn’t shown to be total in scope, as you say). Similarly, this result is not being celebrated as “total evidence.” It’s evidence of deceptive alignment being stable in a very small set of situations. For deceptive alignment to matter in practice, it has to occur in enough situations for it to practically arise and be consistently executed along.
In either case, both results would indeed be (some, perhaps small and preliminary) evidence that good alignment and deceptive alignment are extremely stable under training.
Also, in addition to “how stable is (un)alignment”, there’s the perspective of “how good are we at ensuring the behavior we want”. Both the presented work and your hypothetical experiment are bad news about the latter.
Showing that nice behavior is hard to train out, would be bad news? We in fact want nice behavior (in the vast majority of situations)! It would be great news if benevolent purposes were convergently drilled into AI by the data. (But maybe we’re talking past each other.)
I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”
Thanks for the response. (Yeah, I think there’s some talking past each other going on.)
On further reflection, you are right about the update one should make about a “really hard to get it to stop being nice” experiment. I agree that it’s Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it’s also the case that “AI that is aligned half of the time isn’t aligned” is a relevant consideration, but as the saying goes, “both can be true”.)
Showing that nice behavior is hard to train out, would be bad news?
My point is not quite that it would be bad news overall, but bad news from the perspective of “how good are we at ensuring the behavior we want”.
I now notice that my language was ambiguous. (I edited it for clarity.) When I said “behavior we want”, I meant “given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?”, as opposed to “can we make the model behave according to human values”. And what I tried to claim was that it’s bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)
No, this doesn’t seem very symmetry breaking and it doesn’t invalidate my point. The hypothetical experiment would still be Bayesian evidence that alignment is extremely stable; just not total evidence (because the alignment wasn’t shown to be total in scope, as you say). Similarly, this result is not being celebrated as “total evidence.” It’s evidence of deceptive alignment being stable in a very small set of situations. For deceptive alignment to matter in practice, it has to occur in enough situations for it to practically arise and be consistently executed along.
In either case, both results would indeed be (some, perhaps small and preliminary) evidence that good alignment and deceptive alignment are extremely stable under training.
Showing that nice behavior is hard to train out, would be bad news? We in fact want nice behavior (in the vast majority of situations)! It would be great news if benevolent purposes were convergently drilled into AI by the data. (But maybe we’re talking past each other.)
I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”
Thanks for the response. (Yeah, I think there’s some talking past each other going on.)
On further reflection, you are right about the update one should make about a “really hard to get it to stop being nice” experiment. I agree that it’s Bayesian evidence for alignment being sticky/stable vs. being fragile/sensitive. (I do think it’s also the case that “AI that is aligned half of the time isn’t aligned” is a relevant consideration, but as the saying goes, “both can be true”.)
My point is not quite that it would be bad news overall, but bad news from the perspective of “how good are we at ensuring the behavior we want”.
I now notice that my language was ambiguous. (I edited it for clarity.) When I said “behavior we want”, I meant “given a behavior, be it morally good or bad or something completely orthogonal, can we get that in the system?”, as opposed to “can we make the model behave according to human values”. And what I tried to claim was that it’s bad news from the former perspective. (I feel uncertain about the update regarding the latter: as you say, we want nice behavior; on the other hand, less control over the behavior is bad.)