Buck comments on johnswentworth’s Shortform

Buck 27 Dec 2024 15:24 UTC
LW: 13 AF: 9
8
AF
Ok, so sounds like given 15-25 mins per problem (and maybe with 10 mins per problem), you get 80% correct. This is worse than o3, which scores 87.7%. Maybe you’d do better on a larger sample: perhaps you got unlucky (extremely plausible given the small sample size) or the extra bit of time would help (though it sounds like you tried to use more time here and that didn’t help). Fwiw, my guess from the topics of those questions is that you actually got easier questions than average from that set.
I continue to think these LLMs will probably outperform (you with 30 mins). Unfortunately, the measurement is quite expensive, so I’m sympathetic to you not wanting to get to ground here. If you believe that you can beat them given just 5-10 minutes, that would be easier to measure. I’m very happy to bet here.
I think that even if it turns out you’re a bit better than LLMs at this task, we should note that it’s pretty impressive that they’re competitive with you given 30 minutes!
So I still think your original post is pretty misleading [ETA: with respect to how it claims GPQA is really easy].
I think the models would beat you by more at FrontierMath.
- johnswentworth 27 Dec 2024 19:12 UTC
  LW: 6 AF: 6
  2
  AF Parent
  Even assuming you’re correct here, I don’t see how that would make my original post pretty misleading?
  - Buck 27 Dec 2024 19:54 UTC
    LW: 8 AF: 6
    2
    AF Parent
    I think that how you talk about the questions being “easy”, and the associated stuff about how you think the baseline human measurements are weak, is somewhat inconsistent with you being worse than the model.
    - johnswentworth 27 Dec 2024 20:17 UTC
      LW: 9 AF: 7
      5
      AF Parent
      I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it’s been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.
      - Buck 28 Dec 2024 17:36 UTC
        LW: 2 AF: 3
        0
        AF Parent
        What do you mean by “easy” here?