Proposal part 3: Use Paraphraser: One more complication. The outputs of the Shoggoth? Paraphrase them. For example, after each line of reasoning CoT is complete, pause the Shoggoth and have a different specialized paraphraser model generate 10 syntactically distinct but semantically equivalent ways of saying the same thing. Then pick one at random and replace the original line of reasoning with it. Then boot up the Shoggoth again and have it continue the CoT.
Any thoughts on how much of a (computational) alignment tax this would impose? Related, thoughts on how big of a worry scheming/opaque cognitions in the paraphrasers would be? (e.g. one intuition here might be that the paraphrasers might be ‘trusted’ in control terminology—incapable of scheming because too weak; in this case the computational alignment tax might also be relatively low, if the paraphrasers are much smaller than the Face and the Shoggoth).
I think the risk of the paraphraser scheming seems pretty low if it is a pretrained smaller LLM which has been task-fine-tuned on paraphrasing sentences while keeping semantic meaning approximately the same. If you create this ahead of time, and freeze it (use it in inference mode only) during the training of the shoggoth, then any scheming about the shoggoth’s purposes would still need to come from the shoggoth.
Could the shoggoth learn jailbreaks to get the paraphraser to say more specific things? This seems hard, but possible. To prevent this, you can do a cheap preventative action: take your small base model (e.g. Haiku) and make 100 different fine-tuned versions of it using different datasets and maybe vary some other training parameters. When training the shoggoth, always pick one of the 100 paraphrasers at random for each sentence. That seems like a hard ask for the shoggoth to jailbreak! Especially since it wouldn’t know, when generating the sentence of its next CoT step, which of the 100 paraphrasers was going to be chosen to paraphrase the sentence.
This adds negligible cost, since the fine-tuning of the smaller model is cheap. You can use something like S-LoRa to swap between the different fine-tuned Haiku versions, so you don’t have to load them all into GPU memory at once in order to do the inference.
If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph.
But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that’s 10x smaller still, and now the burden is only 3%.
idk just a guess
Paraphraser would definitely have opaque cognition but hopefully it wouldn’t be scheming, because it wouldn’t have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you’ll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good “semantic similarity score” (this shouldn’t be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph’s idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.
Any thoughts on how much of a (computational) alignment tax this would impose? Related, thoughts on how big of a worry scheming/opaque cognitions in the paraphrasers would be? (e.g. one intuition here might be that the paraphrasers might be ‘trusted’ in control terminology—incapable of scheming because too weak; in this case the computational alignment tax might also be relatively low, if the paraphrasers are much smaller than the Face and the Shoggoth).
I think the risk of the paraphraser scheming seems pretty low if it is a pretrained smaller LLM which has been task-fine-tuned on paraphrasing sentences while keeping semantic meaning approximately the same. If you create this ahead of time, and freeze it (use it in inference mode only) during the training of the shoggoth, then any scheming about the shoggoth’s purposes would still need to come from the shoggoth.
Could the shoggoth learn jailbreaks to get the paraphraser to say more specific things? This seems hard, but possible. To prevent this, you can do a cheap preventative action: take your small base model (e.g. Haiku) and make 100 different fine-tuned versions of it using different datasets and maybe vary some other training parameters. When training the shoggoth, always pick one of the 100 paraphrasers at random for each sentence. That seems like a hard ask for the shoggoth to jailbreak! Especially since it wouldn’t know, when generating the sentence of its next CoT step, which of the 100 paraphrasers was going to be chosen to paraphrase the sentence.
This adds negligible cost, since the fine-tuning of the smaller model is cheap. You can use something like S-LoRa to swap between the different fine-tuned Haiku versions, so you don’t have to load them all into GPU memory at once in order to do the inference.
If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph.
But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that’s 10x smaller still, and now the burden is only 3%.
idk just a guess
Paraphraser would definitely have opaque cognition but hopefully it wouldn’t be scheming, because it wouldn’t have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.
The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you’ll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
So, I imagine that prompting a vanilla LLM would not be a great approach.
I think first you want to build a fairly good “semantic similarity score” (this shouldn’t be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph’s idea here.
I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.