My reading of Appendix A is that the group did its own judging, i.e., did not submit answers to Codeforces.
They generated lots of human verified test data, but then human implementors would do something similar.
They trained on Github code, plus solutions code on Codeforces. Did they train on Codeforces solutions code that solved any of the problems? Without delving much deeper into the work, I cannot say. They do call out the fact that the solutions did not include chunks of copy-pasted code.
To what extent are the successes presented representative of the problems tried? That is, did they try to solve lots of problems and we are seeing the cases that worked well? The fact that they were able to get solutions to some problems was impressive.
The solved problems had short solutions. How well does the technique scale to problems requiring more code for their solution? I suspect it doesn’t, but then there are applications where the solutions are often short.
My reading of Appendix A is that the group did its own judging, i.e., did not submit answers to Codeforces.
They generated lots of human verified test data, but then human implementors would do something similar.
They trained on Github code, plus solutions code on Codeforces. Did they train on Codeforces solutions code that solved any of the problems? Without delving much deeper into the work, I cannot say. They do call out the fact that the solutions did not include chunks of copy-pasted code.
To what extent are the successes presented representative of the problems tried? That is, did they try to solve lots of problems and we are seeing the cases that worked well? The fact that they were able to get solutions to some problems was impressive.
The solved problems had short solutions. How well does the technique scale to problems requiring more code for their solution? I suspect it doesn’t, but then there are applications where the solutions are often short.