I think lots of folks (but not all) would be up in arms, claiming “but modern results won’t generalize to future systems!” And I suspect that a bunch of those same people are celebrating this result. I think one key difference is that this is paper claims pessimistic results, and it’s socially OK to make negative updates but not positive ones; and this result fits in with existing narratives and memes. Maybe I’m being too cynical, but that’s my reaction.
Fwiw, my reaction to something like “we can finetune the AI to be nice in a stable way” is more like—but is it actually “nice”? I.e., I don’t feel like we’re at all clear on what “niceness” is, and behavioral proxies to that effect feel like some, but only pretty weak evidence about it.
This is my basic concern with evaluations, too. At the moment they just don’t seem at all robust enough for me to feel confident about what the results mean. But I see the sleeper agents work as progress towards the goal of “getting better at thinking about deception” and I feel pretty excited about that.
I think it’s reasonable to question whether these systems are in fact deceptive (just as I would with “niceness”). But when I evaluate results it’s not like “is this an optimistic or pessimistic update” and more like “does it seem like we understand more about X than we did before?” I think we understand more about deception because of this work, and I think that’s cool and important.
Fwiw, my reaction to something like “we can finetune the AI to be nice in a stable way” is more like—but is it actually “nice”? I.e., I don’t feel like we’re at all clear on what “niceness” is, and behavioral proxies to that effect feel like some, but only pretty weak evidence about it.
This is my basic concern with evaluations, too. At the moment they just don’t seem at all robust enough for me to feel confident about what the results mean. But I see the sleeper agents work as progress towards the goal of “getting better at thinking about deception” and I feel pretty excited about that.
I think it’s reasonable to question whether these systems are in fact deceptive (just as I would with “niceness”). But when I evaluate results it’s not like “is this an optimistic or pessimistic update” and more like “does it seem like we understand more about X than we did before?” I think we understand more about deception because of this work, and I think that’s cool and important.