Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face.
I feel like you have a bit of a credit-assignment problem here. I think it should be straightforward to patch, but it seems worth being explicit about it.
If the final answer from Face is factually correct, is that because Shoggoth did good reasoning or Face did good reasoning? Similarly, if the answer is rated highly, is it because Face put a pretty spin on it or because Shoggoth framed it nicely in the CoT...
I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth’s contribution vs Face’s contribution.
A different way to solve this would be to have a factual-accuracy checker that didn’t respond negatively to rude CoT, and just graded the Shoggoth on pure correctness. Then had a human-reaction-emulator do the grading for Face (using only correct CoT completions saved from Shoggoth’s training).
I feel like you have a bit of a credit-assignment problem here. I think it should be straightforward to patch, but it seems worth being explicit about it.
If the final answer from Face is factually correct, is that because Shoggoth did good reasoning or Face did good reasoning? Similarly, if the answer is rated highly, is it because Face put a pretty spin on it or because Shoggoth framed it nicely in the CoT...
I think the way around this is to make multiple roll-outs per model per problem. Get n different CoTs from Shoggoth, then for each of those get k different summarizations from Face. You then have n*k final answers. Optimal values for n and k probably depend on how expensive the roll-outs are. This population of answers allows you to usefully get a feedback signal about Shoggoth’s contribution vs Face’s contribution.
A different way to solve this would be to have a factual-accuracy checker that didn’t respond negatively to rude CoT, and just graded the Shoggoth on pure correctness. Then had a human-reaction-emulator do the grading for Face (using only correct CoT completions saved from Shoggoth’s training).