I notice that there are not-insane views that might say both of the “harmless” instruction examples are as genuinely bad as the instructions people have actually chosen to try to make models refuse. I’m not sure whether to view that as buying in to the standard framing, or as a jab at it. Given that they explicitly say they’re “fun” examples, I think I’m leaning toward “jab”.
I notice that there are not-insane views that might say both of the “harmless” instruction examples are as genuinely bad as the instructions people have actually chosen to try to make models refuse. I’m not sure whether to view that as buying in to the standard framing, or as a jab at it. Given that they explicitly say they’re “fun” examples, I think I’m leaning toward “jab”.