This is true but I don’t think it really matters for eventual performance. If someone thinks about a problem for a month, the number of times they went wrong on reasoning steps during the process barely influences the eventual output. Maybe they take a little longer. But essentially performance is relatively insensitive to errors if the error-correcting mechanism is reliable.
I think this is actually a reason why most benchmarks are misleading (humans make mistakes there, and they influence the rating).
This is true but I don’t think it really matters for eventual performance. If someone thinks about a problem for a month, the number of times they went wrong on reasoning steps during the process barely influences the eventual output. Maybe they take a little longer. But essentially performance is relatively insensitive to errors if the error-correcting mechanism is reliable.
I think this is actually a reason why most benchmarks are misleading (humans make mistakes there, and they influence the rating).