If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).
If this task is bad for operationalization reasons, there are other theorem proving benchmarks. Unfortunately it looks like there aren’t a lot of people that are currently trying to improve on the known benchmarks, as far as I’m aware.
The code generation benchmarks are slightly more active. I’m personally partial to Hendrycks et al.’s APPS benchmark, which includes problems that “range in difficulty from introductory to collegiate competition level and measure coding and problem-solving ability.” (Github link).