My bet is that this isn’t an attempt to erode alignment, but is instead based on thinking that lumping together intentional bad actions with mistakes is a reasonable starting point for building safeguards. Then the report doesn’t distinguish between these due to a communication failure. It could also just be a more generic communication failure.
(I don’t know if I agree that lumping these together is a good starting point to start experimenting with safeguards, but it doesn’t seem crazy.)
Communication is indeed hard, and it’s certainly possible that this isn’t intentional. On the other hand, making mistakes is quite suspicious when they’re also useful for your agenda. But I agree that we probably shouldn’t read too much into it. The system card doesn’t even mention the possibility of the model acting maliciously, so maybe that’s simply not in scope for it?
My bet is that this isn’t an attempt to erode alignment, but is instead based on thinking that lumping together intentional bad actions with mistakes is a reasonable starting point for building safeguards. Then the report doesn’t distinguish between these due to a communication failure. It could also just be a more generic communication failure.
(I don’t know if I agree that lumping these together is a good starting point to start experimenting with safeguards, but it doesn’t seem crazy.)
Communication is indeed hard, and it’s certainly possible that this isn’t intentional. On the other hand, making mistakes is quite suspicious when they’re also useful for your agenda. But I agree that we probably shouldn’t read too much into it. The system card doesn’t even mention the possibility of the model acting maliciously, so maybe that’s simply not in scope for it?
It doesn’t seem very useful for them IMO.