> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation
That’s interesting. Any idea why it’s likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?
> key discrepancies in the explanations that lead models to support the biased answer instead of the correct answer in many cases come near the end of the explanation
That’s interesting. Any idea why it’s likelier to have the invalid reasoning step (that allows the biased conclusion) towards the end of the CoT rather than right at the start?
Towards the end it’s easier to see how to change the explanation in order to get the ‘desired’ answer.