Self-coordination might also be easier if AIs can recognize themselves by being good at recognizing features that survive paraphrasing and that vary very consistently between answers
Yeah, this is the case I’m thinking about. I think on longer outputs it will be possible for the AI to use a more nebulous sense of “style”, rather than only relying on specific features like “using X secret phrase” (which are presumably easier to redact away).
I think this could be countered by paraphrasing at a higher level—i.e. rewriting entire paragraphs from scratch rather than just rewording sentences. I’m noting this down as a potential topic if I have time to do another project.
Thanks for the discussion, it’s helped me clear up my thoughts on this topic.
Ok, I’ve been thinking about this more and I’ve figured out what exactly was bugging me about this red-teaming scheme. Is it fair to say you assume that your red team will naturally cover the same area of solution-space as actual malicious AIs aiming to self-coordinate?
If yes, I think that assumption is faulty. Self-coordinating AIs will have access to classes of strategies that a human red-teamer would not think of unless they spent significant effort to “think like the enemy”, so to speak. These strategies will be usable even if the AI isn’t much smarter than a human, simply because the AI is a very different kind of mind.
As a simple but hopefully demonstrative example: if the AI is asked to generate a list of items, it could communicate information based on the order of items in the list. Given that the AI is specifically attempting to self-coordinate, it can assume its other copy will come up with the same mapping between sequences and secret signals.[1] This easily survives sentence-level paraphrasing, even though it’s obviously defeated by shuffling the list.
There are plausibly very many strategies in this vein, where you encode information at the structural rather than the content level. So even if each individual strategy is trivial to defeat, you could just end up playing strategy whack-a-mole.
The information density of this is low though—the AI can’t assign meaningful information to too many sequences, since a synthetic input could then generate one by chance, so it can only make a sparse subset of sequences meaningful.