IIUC, the contest was only on time, not on correctness? Because correctness was verified by some pre-defined automatic tests? If so, how was Codex deployed solo? Did they just sample it many times on the same prompt until it produced something that passed the tests? Or something more sophisticated?
Also:
In all fairness, the competition paradigm was many-to-some — everyone faced the same five problems. So, Codex will have a rich data of differentiated prompts for the same set of problems. It might give the AI a learning edge (in the case of concurrent active learning).
This makes no sense to me. Do you assume solo-Codex exploited the prompts submitted by other competitors? Or that the assistant-Codexes communicated with each other somehow? I kinda doubt either of those happened.
The only correctness filters are the hidden testcases (as is standard in most competitive coding competition). You can check the leaderboard—the positions correlate with the cumulative time taken to solve problems & codex assists. If there are any hidden metrics, I wouldn’t know.
If so, how was Codex deployed solo? Did they just sample it many times on the same prompt until it produced something that passed the tests? Or something more sophisticated?
They didn’t reveal this publicly. We can only guess here.
This makes no sense to me. Do you assume solo-Codex exploited the prompts submitted by other competitors? Or that the assistant-Codexes communicated with each other somehow? I kinda doubt either of those happened.
After I was done, I played around with Codex (from a new account). You could only use Codex in the editors within problems. In one of the problems, I cleared the editor and just put in a simple prompt (unrelated to the problem). I remember in one of the assists, it actually generated the code for that specific problem. This is why I assumed there is some state saving, or context awareness.
Hmm, I suppose they might be combining the problem statement and the prompt provided by the user into a single prompt somehow, and feeding that to the network? Either that or they’re cheating :)
Yes, that’s what they did! (Emphasis on the “somehow”—details a mystery to me.) Some piece of intro text for the challenge explained that Codex would receive, as input, both the problem statement (which always included a handful of example inputs/output/explanation triplets), and the user’s current code up to their cursor.
IIUC, the contest was only on time, not on correctness? Because correctness was verified by some pre-defined automatic tests? If so, how was Codex deployed solo? Did they just sample it many times on the same prompt until it produced something that passed the tests? Or something more sophisticated?
Also:
This makes no sense to me. Do you assume solo-Codex exploited the prompts submitted by other competitors? Or that the assistant-Codexes communicated with each other somehow? I kinda doubt either of those happened.
The only correctness filters are the hidden testcases (as is standard in most competitive coding competition). You can check the leaderboard—the positions correlate with the cumulative time taken to solve problems & codex assists. If there are any hidden metrics, I wouldn’t know.
They didn’t reveal this publicly. We can only guess here.
After I was done, I played around with Codex (from a new account). You could only use Codex in the editors within problems. In one of the problems, I cleared the editor and just put in a simple prompt (unrelated to the problem). I remember in one of the assists, it actually generated the code for that specific problem. This is why I assumed there is some state saving, or context awareness.
There is no state saving or learning at test time. The prompts were prepended to the API calls, you could see it in the requests
Hmm, I suppose they might be combining the problem statement and the prompt provided by the user into a single prompt somehow, and feeding that to the network? Either that or they’re cheating :)
Yes, that’s what they did! (Emphasis on the “somehow”—details a mystery to me.) Some piece of intro text for the challenge explained that Codex would receive, as input, both the problem statement (which always included a handful of example inputs/output/explanation triplets), and the user’s current code up to their cursor.