I no longer consider agents with superhuman performance in competitive programming to be a ridiculous thing to pursue.
Dan Hendrycks and Steven Basart et al. recently released APPS, an ML benchmark for measuring the performance of ML models at the task of writing code. One part of their benchmark measures the performance of code on competitive programming questions. I wrote a Metaculus question on when people expect this benchmark to be solved—operationalized as getting above 80% strict accuracy on the competitive programming section.
Initial results are encouraging. GPT-Neo 2.7B passes nearly 20% of test cases on average for introductory coding problems, when the model is allowed to give 5 attempts (see Table 4 in the paper). A fine-tuned GPT-J-6B is likely to be even better.
The APPS repository also gives the fine-tuned weights for GPT-Neo-2.7 and code to run it. Though without a GPU it takes roughly forever.
I asked Dan Hendrycks for the performance of GPT-J-6B on APPS on the Eleuther AI discord. He didn’t say they were definitely going to test it, but my take-away was that it might happen.
I could image a test driven automated programming evolving in the next ten to twenty years, were a LM-guided search tries to create functions according to a description that pass all the test cases.
Dan Hendrycks and Steven Basart et al. recently released APPS, an ML benchmark for measuring the performance of ML models at the task of writing code. One part of their benchmark measures the performance of code on competitive programming questions. I wrote a Metaculus question on when people expect this benchmark to be solved—operationalized as getting above 80% strict accuracy on the competitive programming section.
Initial results are encouraging. GPT-Neo 2.7B passes nearly 20% of test cases on average for introductory coding problems, when the model is allowed to give 5 attempts (see Table 4 in the paper). A fine-tuned GPT-J-6B is likely to be even better.
The APPS repository also gives the fine-tuned weights for GPT-Neo-2.7 and code to run it. Though without a GPU it takes roughly forever.
I asked Dan Hendrycks for the performance of GPT-J-6B on APPS on the Eleuther AI discord. He didn’t say they were definitely going to test it, but my take-away was that it might happen.
I could image a test driven automated programming evolving in the next ten to twenty years, were a LM-guided search tries to create functions according to a description that pass all the test cases.