Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.
Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get 2⁄10, if not 3⁄10 correct under high-compute settings.