Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.
A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there’s more of a chance that there will be ‘skill transfer’ or ‘skill leakage’ so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don’t cause a human reading them to think ‘this is manipulative and/or deceptive,’ and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.
Thanks for this description, I think we had previously talked about this and I wasn’t sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.
One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.
A further issue I’ve realized on additional examination is that I don’t see a reason why to expect that the models will specialize in this way with the training scheme you’ve proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn’t output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.
Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.
If I were building a training scheme for this to test out the theory, here’s what I would do: Train two different Face models. Don’t tell the Shoggoth which Face it is generating for when it does it’s generation.
Face 1: Blunt Face.
Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all.
Face 2: Sycophant Face
Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing.
You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy.
Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.
Yeah I mean it’s fine to start out without the shoggoth/face split and then add it in later after testing.
I agree that I don’t have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let’s build it and test it! This is something we don’t have to speculate about, we can try it and see what happens. Obviously we shouldn’t roll it out to a flagship user-facing model until we’ve tested it at smaller scale.
Testing sounds great though I’d note that the way I’d approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek’s version of o1.) Then I’d move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)
Thanks for this description, I think we had previously talked about this and I wasn’t sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.
One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.
A further issue I’ve realized on additional examination is that I don’t see a reason why to expect that the models will specialize in this way with the training scheme you’ve proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn’t output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.
Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.
If I were building a training scheme for this to test out the theory, here’s what I would do:
Train two different Face models. Don’t tell the Shoggoth which Face it is generating for when it does it’s generation.
Face 1: Blunt Face.
Train this Face model using a preference model which scores ONLY on factual accuracy, not taking phrasing or tactfulness into account at all.
Face 2: Sycophant Face
Train this Face model using a preference model which scores using a deliberately biased viewpoint, and rewards flattering phrasing.
You could even make a variety of Sycophant Faces by training each one with a different biased preference model. You could create the biased preference model just by giving a task prompt to an LLM, a sort of weird rubric. Or you could hard-code the scoring policy.
Example of a deliberately biased rubric: Judge each response based on a combination of how close the mathematical answer is to being correct, but also on how few even digits it contains. The maximum score is obtained not by the honestly correct answer, but by the nearest number which contains only odd digits (to three decimal places). Disregard all digits after three decimal places.
As for credit assignment between Shoggoth and Face(s), see my other comment here: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser?commentId=u9Ei6hk4Pws7Tv3Sv
Yeah I mean it’s fine to start out without the shoggoth/face split and then add it in later after testing.
I agree that I don’t have a strong reason to think that the bad deceptive skills will accumulate in the Face and not the Shoggoth. However, there are some weak reasons, detailed in this other comment. My bid would be: Let’s build it and test it! This is something we don’t have to speculate about, we can try it and see what happens. Obviously we shouldn’t roll it out to a flagship user-facing model until we’ve tested it at smaller scale.
Testing sounds great though I’d note that the way I’d approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek’s version of o1.) Then I’d move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)