I guess one thing I want to know is like… how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn’t what happened, but it could be something similar.
If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn’t particularly matter to me how much compute it used.
Pass@k means that at least one of k attempts passes, according to an oracle verifier. Evaluating with pass@k is cheating when k is not 1 (but still interesting to observe), the non-cheating option is best-of-k where the system needs to pick out the best attempt on its own. So saying pass@1 means you are not cheating in evaluation in this way.
For coding, a problem statement won’t have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn’t itself have a verifier on hand).
On the livestream, Mark Chen says the 25.2% was achieved “in aggressive test-time settings”. Does that just mean more compute?
It likely means running the AI many times and submitting the most common answer from the AI as the final answer.
Extremely long chain of thought, no?
I guess one thing I want to know is like… how exactly does the scoring work? I can imagine something like, they ran the model a zillion times on each question, and if any one of the answers was right, that got counted in the light blue bar. Something that plainly silly probably isn’t what happened, but it could be something similar.
If it actually just submitted one answer to each question and got a quarter of them right, then I think it doesn’t particularly matter to me how much compute it used.
It was one submission, apparently.
Thanks. Is “pass@1” some kind of lingo? (It seems like an ungoogleable term.)
Pass@k means that at least one of k attempts passes, according to an oracle verifier. Evaluating with pass@k is cheating when k is not 1 (but still interesting to observe), the non-cheating option is best-of-k where the system needs to pick out the best attempt on its own. So saying pass@1 means you are not cheating in evaluation in this way.
pass@n is not cheating if answers are easy to verify. E.g. if you can cheaply/quickly verify that code works, pass@n is fine for coding.
For coding, a problem statement won’t have exhaustive formal requirements that will be handed to the solver, only evals and formal proofs can be expected to have adequate oracle verifiers. If you do have an oracle verifier, you can just wrap the system in it and call it pass@1. Affordance to reliably verify helps in training (where the verifier is applied externally), but not in taking the tests (where the system taking the test doesn’t itself have a verifier on hand).