I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”
I’m confused here. It seems to me that if your AI normally does evil things and then sometimes (in certain situations) does good things, I would not call it “aligned”, and certainly the alignment is not stable (because it almost never takes “good” actions). Although this thing is also not robustly “misaligned” either.
Fine. I’m happy to assume that, in my hypothetical, we observe that it’s always very nice and hard to make not-nice. I claim that a bunch of people would still skeptically ask “but how is this relevant to future models?”