Rafael Harth comments on o3

Rafael Harth 21 Dec 2024 11:12 UTC
11 points
−1
My probably contrarian take is that I don’t think improvement on a benchmark of math problems is particularly scary or relevant. It’s not nothing—I’d prefer if it didn’t improve at all—but it only makes me slightly more worried.
- Matt Goldenberg 21 Dec 2024 15:05 UTC
  5 points
  1
  Parent
  can you say more about your reasoning for this?
  - Rafael Harth 21 Dec 2024 18:01 UTC
    14 points
    0
    Parent
    About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/when LLMs solve them. They’re still ¹⁄₁₀ and nothing has changed in the past year, and I doubt o3 will do better. (But I’m not making them public.)
    
    Will write a reply to this comment when I can test it.
    - Matt Goldenberg 21 Dec 2024 18:36 UTC
      4 points
      0
      Parent
      can you say the types of problems they are?
      - Rafael Harth 21 Dec 2024 19:30 UTC
        4 points
        0
        Parent
        You could call them logic puzzles. I do think most smart people on LW would get ¹⁰⁄₁₀ without too many problems, if they had enough time, although I’ve never tested this.
        Noosphere89 21 Dec 2024 19:36 UTC
        2 points
        0
        Parent
        Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get ²⁄₁₀, if not ³⁄₁₀ correct under high-compute settings.
    - IC Rainbow 21 Dec 2024 22:10 UTC
      3 points
      0
      Parent
      What’s the last model you did check with, o1-pro?
      - Rafael Harth 22 Dec 2024 10:55 UTC
        2 points
        0
        Parent
        Just regular o1, I have the 20$/month subscription not the 200$/month
    - O O 21 Dec 2024 23:42 UTC
      −1 points
      0
      Parent
      Do you have a link to these?
- gonz 21 Dec 2024 14:39 UTC
  1 point
  0
  Parent
  What benchmarks (or other capabilities) do you see as more relevant, and how worried were you before?