Daniel Kokotajlo comments on Why Don’t We Just… Shoggoth+Face+Paraphraser?

Daniel Kokotajlo 20 Nov 2024 15:57 UTC
4 points
2
If the shoggoth and paraphraser are both same-size models, then arguably you are adding a significant computational cost by having the paraphraser generate 10x paraphrases of each paragraph.

But you can probably get away with a significantly smaller paraphraser model. Like maybe you use a 10x smaller paraphraser model that generates 10x paraphrases at first, but then once the system is established you reduce it to 3x paraphrases, meaning your overall increased computational burden is 30%. And then later you train a specialized paraphraser model that’s 10x smaller still, and now the burden is only 3%.

idk just a guess

Paraphraser would definitely have opaque cognition but hopefully it wouldn’t be scheming, because it wouldn’t have agency skills or be situationally aware. It would just be a fine-tune of a base model. And a dumb/small base model at that. Heck it could even just be a prompted base model.
- abramdemski 20 Nov 2024 18:12 UTC
  6 points
  4
  Parent
  The better the paraphraser, the better the anti-steganography incentive. If the paraphraser itself is easily manipulable by adversarial inputs, then you’ll see some weird things like the CoT manipulating the paraphraser to get more useful cognition out of it rather than mere paraphrasing. This is undesirable.
  So, I imagine that prompting a vanilla LLM would not be a great approach.
  I think first you want to build a fairly good “semantic similarity score” (this shouldn’t be too hard) and then you want to fine-tune an LLM to encourage high-entropy paraphrasing (reward entropy, while also heavily punishing low semantic similarity score). This is similar to @CBiddulph’s idea here.
  I agree that this could plausibly work with a relatively small model and therefore result in a relatively low alignment tax. However, note that my view of this whole proposal as an alignment technique requires that you try and change your RL incentive structure when you see a deceptive mismatch between what the shoggoth says and what the face says. This is a significant alignment tax.