Rafael Harth comments on o3

Rafael Harth 21 Dec 2024 18:01 UTC
14 points
0
About two years ago I made a set of 10 problems that imo measure progress toward AGI and decided I’d freak out if/when LLMs solve them. They’re still ¹⁄₁₀ and nothing has changed in the past year, and I doubt o3 will do better. (But I’m not making them public.)

Will write a reply to this comment when I can test it.
- Rafael Harth 25 Jan 2025 23:49 UTC
  5 points
  0
  Parent
  Deepseek gets ²⁄₁₀.
  
  I’m pretty shocked by this result. Less because the ²⁄₁₀ number itself, but by the specific one it solved. My P(LLMs can scale to AGI) increased significantly, although not to 50%.
  - Rafael Harth 1 Feb 2025 13:42 UTC
    4 points
    0
    Parent
    o3-mini-high gets ³⁄₁₀; this is essentially the same as DeepSeek (there were two where DeepSeek came very close, this is one of them). I’m still slightly more impressed with DeepSeek despite the result, but it’s very close.
    - Meiren 1 Feb 2025 13:54 UTC
      1 point
      0
      Parent
      What score would it take for you to update your p(LLMs scale to AGI) above 50%?
      - Rafael Harth 2 Feb 2025 9:55 UTC
        5 points
        0
        Parent
        Tricky to answer actually.
        
        I can say more about my model now. The way I’d put it now (h/t Steven Byrnes) is that there are three interesting classes of capabilities
        
        A: sequential reasoning of any kind
        B: sequential reasoning on topics where steps aren’t easily verifiable
        C: the type of thing Steven mentions here, like coming up with new abstractions/concepts to integrate into your vocabulary to better think about something
        
        Among these, obviously B is a subset of A. And while it’s not obvious, I think C is probably best viewed as a subset of B. Regardless, I think all three are required for what I’d call AGI. (This is also how I’d justify the claim that no current LLM is AGI.) Maybe C isn’t strictly required, I could imagine a mind getting superhuman performance without it, but I think given how LLMs work otherwise, it’s not happening.
        
        Up until DeepSeek, I would have also said LLMs are terrible A. (This is probably a hot take, but I genuinely think it’s true despite benchmark performances continuing to go up.) My tasks were designed to test A, with the hypothesis that LLMs will suck at A indefinitely. For a while, it seemed like people weren’t even focusing on A, which is why I didn’t want to talk a bout it. But this concern is no longer applicable; the new models are clearly focused on improving sequential reasoning. However, o1 was terrible at it (imo), almost no improvement form GPT-4 proper, so I actually found o1 reassuring.
        
        This has now mostly been falsified with DeepSeek and o3. (I know the numbers don’t really tell the story since it just went from 1 to 2, but like, including which stuff they solved and how they argue, DeepSeek was the where I went “oh shit they can actually do legit sequential reasoning now”.) Now I’m expecting most of the other tasks to fall as well, so I won’t do similar updates if it goes to ⁵⁄₁₀ or ⁸⁄₁₀. The hypothesis “A is an insurmountable obstacle” can only be falsified once.
        
        That said, it still matters how fast they improve. How much it matters depends on whether you think better performance on A is progress toward B/C. I’m still not sure about this, I’m changing my views a lot right now. So idk. If they score ¹⁰⁄₁₀ in the next year, my p(LLMs scale to AGI) will definitely go above 50%, probably if they do it in 3 years as well, but that’s about the only thing I’m sure about.
        Thane Ruthenis 9 Feb 2025 5:54 UTC
        2 points
        0
        Parent
        Any chance you can post (or PM me) the three problems AIs have already beaten?
- Matt Goldenberg 21 Dec 2024 18:36 UTC
  4 points
  0
  Parent
  can you say the types of problems they are?
  - Rafael Harth 21 Dec 2024 19:30 UTC
    4 points
    0
    Parent
    You could call them logic puzzles. I do think most smart people on LW would get ¹⁰⁄₁₀ without too many problems, if they had enough time, although I’ve never tested this.
    - Noosphere89 21 Dec 2024 19:36 UTC
      2 points
      1
      Parent
      Assuming they are verifiable or have an easy way to verify whether or not a solution does work, I expect o3 to at least get ²⁄₁₀, if not ³⁄₁₀ correct under high-compute settings.
- IC Rainbow 21 Dec 2024 22:10 UTC
  3 points
  0
  Parent
  What’s the last model you did check with, o1-pro?
  - Rafael Harth 22 Dec 2024 10:55 UTC
    2 points
    0
    Parent
    Just regular o1, I have the 20$/month subscription not the 200$/month
- O O 21 Dec 2024 23:42 UTC
  −1 points
  0
  Parent
  Do you have a link to these?