wassname comments on the case for CoT unfaithfulness is overstated

wassname 3 Jan 2025 7:05 UTC
1 point
0
Well they did this with o3′s deliberative alignment paper. The results seem promising, but they used an “easy” OOD test for LLM’s (language), and didn’t compare it to the existing baseline of RHLF. Still an interesting paper.