That’s a great question. I am not quite sure but WMDP-cyber does look relevant. If you are interested in working on a new benchmark for unlearning and secret collusion, do reach out to us!
schroederdewitt
Karma: 87
Interesting! Hope our work will catch his attention.
Thanks Nathan these are fascinating questions!
“What if there was an offender who wasn’t sure there was anyone ‘out there’ to collaborate with, but they felt pretty confident they could sneak messages past the monitor. They might just go ‘fishing’ with a constant stream of hidden messages, most of low importance or nonsense. Then there’d be no pattern of innocent—steganography—innocent -steganography… the steganography would be a constant underlying everything.”
This is an interesting thought. Certainly, if steganography could be seen as an innocuous activity by itself, then its detection alone would not be very helpful and, instead, one would actually have to be able to decode the hidden messages to decide whether it is desirable behaviour or not.
“Another thought: What if the offender hoped that some time in the future there might be a lapse in security such that it could at that point transfer a one-time pad to a collaborator. In the meantime, they might encode and send many secret messages , trusting that the collaborator would record all messages and then reanalyze them once the one-time pad was received. That seems like it would also be hard for the monitor to catch.”
Yes, such a post-hoc approach is certainly possible and I agree that it does contribute to the safety/security risks.