Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.
One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.
A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won’t help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn’t avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns.
Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can’t be vulnerable to steering by adversarial inputs.
Process supervision would be helpful, particularly if the model is punished for sensitive dependence on paraphrasings of the CoT, or rewarded for producing CoT which is useful even after paraphrasing.
One of the concerns that I have is that (I believe) the LLM before the CoT-training will be sensitive to some non-human-readable patterns, similar to adversarial examples for image recognition. The CoT training will encourage any patterns which help reach the desired conclusions, whether those patterns are human-readable or not.
A lot of possible process-supervision procedures will not do anything to help prevent this. For example, rewarding CoT for being human-understandable won’t help, because the reward signal there can miss the subtle patterns being used. It only avoids the outcome where the CoT looks non-human-readable; it doesn’t avoid the outcome where the CoT actually contains a bunch of human-unreadable patterns.
Using paraphrasing to avoid this is heavily dependent on the strength of the paraphraser. The paraphraser needs to have really good coverage of alternative ways to convey the same semantic information, so that any non-human-readable patterns get randomized. The paraphraser itself can’t be vulnerable to steering by adversarial inputs.