For scientific purposes. People don’t really have time to review that many CoT chains anyway, so 1 per day gets most of the value of what they’d realistically do. Plus they can target it at the stuff that’s suspicious. (Simple example: Suppose they get an impressive-seeming answer that later turns out to be total BS hallucination. They then think “I wonder if the model was BSing me” and click “view CoT.” Then they see whether it was an innocent mistake or not.)
For scientific purposes. People don’t really have time to review that many CoT chains anyway, so 1 per day gets most of the value of what they’d realistically do. Plus they can target it at the stuff that’s suspicious. (Simple example: Suppose they get an impressive-seeming answer that later turns out to be total BS hallucination. They then think “I wonder if the model was BSing me” and click “view CoT.” Then they see whether it was an innocent mistake or not.)