Alex_Altair comments on o3

Alex_Altair 20 Dec 2024 20:50 UTC
4 points
0
I guess one thing I want to know is like… how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn’t what happened, but it could be something similar.
If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn’t particularly matter to me how much compute it used.
- Zach Stein-Perlman 20 Dec 2024 20:52 UTC
  4 points
  0
  Parent
  It was one submission, apparently.
  - Alex_Altair 20 Dec 2024 21:13 UTC
    3 points
    0
    Parent
    Thanks. Is “pass@1” some kind of lingo? (It seems like an ungoogleable term.)
    - Vladimir_Nesov 20 Dec 2024 21:39 UTC
      7 points
      2
      Parent
      Pass@k means that at least one of k attempts passes, according to an oracle verifier. Evaluating with pass@k is cheating when k is not 1 (but still interesting to observe), the non-cheating option is best-of-k where the system needs to pick out the best attempt on its own. So saying pass@1 means you are not cheating in evaluation in this way.
      - Zach Stein-Perlman 20 Dec 2024 21:51 UTC
        4 points
        −2
        Parent
        pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
        Vladimir_Nesov 20 Dec 2024 22:45 UTC
        8 points
        2
        Parent
        For coding, a problem statement won’t have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn’t itself have a verifier on hand).