Alex_Altair comments on o3

Alex_Altair 20 Dec 2024 19:35 UTC
10 points
1
I wish they would tell us what the dark vs light blue means. Specifically, for the FrontierMath benchmark, the dark blue looks like it’s around 8% (rather than the light blue at 25.2%). Which like, I dunno, maybe this is nit picking, but 25% on FrontierMath seems like a BIG deal, and I’d like to know how much to be updating my beliefs.
- Logan Riggs 20 Dec 2024 21:39 UTC
  16 points
  2
  Parent
  From an apparent author on reddit:
  [Frontier Math is composed of] 25% T1 = IMO/undergrad style problems, 50% T2 = grad/qualifying exam style porblems, 25% T3 = early researcher problems
  The comment was responding to a claim that Terence Tao said he could only solve a small percentage of questions, but Terence was only sent the T3 questions.
  What links here?
  - Steven Byrnes's comment on o3 by Zach Stein-Perlman (23 Dec 2024 14:29 UTC; 35 points)
- Eric Neyman 20 Dec 2024 20:46 UTC
  9 points
  8
  Parent
  My random guess is:
  - The dark blue bar corresponds to the testing conditions under which the previous SOTA was 2%.
  - The light blue bar doesn’t cheat (e.g. doesn’t let the model run many times and then see if it gets it right on any one of those times) but spends more compute than one would realistically spend (e.g. more than how much you could pay a mathematician to solve the problem), perhaps by running the model 100 to 1000 times and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning.
  - Zach Stein-Perlman 20 Dec 2024 20:51 UTC
    4 points
    0
    Parent
    and then having the model look at all the runs and try to figure out which run had the most compelling-seeming reasoning
    The FrontierMath answers are numerical-ish (“problems have large numerical answers or complex mathematical objects as solutions”), so you can just check which answer the model wrote most frequently.
    - Eric Neyman 20 Dec 2024 20:55 UTC
      4 points
      2
      Parent
      Yeah, I agree that that could work. I (weakly) conjecture that they would get better results by doing something more like the thing I described, though.
- Alex_Altair 20 Dec 2024 19:37 UTC
  7 points
  6
  Parent
  On the livestream, Mark Chen says the 25.2% was achieved “in aggressive test-time settings”. Does that just mean more compute?
  - Charlie Steiner 21 Dec 2024 4:26 UTC
    2 points
    0
    Parent
    It likely means running the AI many times and submitting the most common answer from the AI as the final answer.
  - Jonas Hallgren 20 Dec 2024 20:38 UTC
    1 point
    0
    Parent
    Extremely long chain of thought, no?
    - Alex_Altair 20 Dec 2024 20:50 UTC
      4 points
      0
      Parent
      I guess one thing I want to know is like… how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn’t what happened, but it could be something similar.
      If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn’t particularly matter to me how much compute it used.
      - Zach Stein-Perlman 20 Dec 2024 20:52 UTC
        4 points
        0
        Parent
        It was one submission, apparently.
        Alex_Altair 20 Dec 2024 21:13 UTC
        3 points
        0
        Parent
        Thanks. Is “pass@1” some kind of lingo? (It seems like an ungoogleable term.)
        Vladimir_Nesov 20 Dec 2024 21:39 UTC
        7 points
        2
        Parent
        Pass@k means that at least one of k attempts passes, according to an oracle verifier. Evaluating with pass@k is cheating when k is not 1 (but still interesting to observe), the non-cheating option is best-of-k where the system needs to pick out the best attempt on its own. So saying pass@1 means you are not cheating in evaluation in this way.
        Zach Stein-Perlman 20 Dec 2024 21:51 UTC
        4 points
        −2
        Parent
        pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
        Vladimir_Nesov 20 Dec 2024 22:45 UTC
        8 points
        2
        Parent
        For coding, a problem statement won’t have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn’t itself have a verifier on hand).