The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you’ll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good “semantic similarity score” (this shouldn’t be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph’s idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you’ll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good “semantic similarity score” (this shouldn’t be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph’s idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.