You could call them logic puzzles. I do think most smart people on LW would get 10⁄10 without too many problems, if they had enough time, although I’ve never tested this.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.
can you say the types of problems they are?
You could call them logic puzzles. I do think most smart people on LW would get 10⁄10 without too many problems, if they had enough time, although I’ve never tested this.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.