Progress on competitive coding seems impressive. Otherwise I am having trouble evaluating it, it seems slightly better than GPT-4 at most tasks, and multi-modal. Tentatively it seems to be significant progress.
The progress in competitive programming seems to be miscalculated in a way that makes Alpha Code 2 appear better than it is. It
Samples 1e6 solutions
Of all the solutions that pass the given test cases, it picks the 10 ones with the best “score”
Submits up to 10 of the solutions until one of them passes
Steps 1 and 2 seem fine, but a human competitor in one of these contests would be penalized for step 3, which AlphaCode2 appears not to be[1]. Further the training set contamination combined with the fact that these are only “easier” div2 questions, imply that the solutions could very well appear in the test set and this just reconstructs that solution near verbatim.
In defense of AlphaCode 2, the fine-tuned scoring model that picks the 10 best might be a non trivial creation. It also seems. AC2 is more sample efficient than AC1, so it is getting better at generating solutions. Assuming nothing funky is happening with the training set, at the limit, this means 1 solution per sample.
Progress on competitive coding seems impressive. Otherwise I am having trouble evaluating it, it seems slightly better than GPT-4 at most tasks, and multi-modal. Tentatively it seems to be significant progress.
The progress in competitive programming seems to be miscalculated in a way that makes Alpha Code 2 appear better than it is. It
Samples 1e6 solutions
Of all the solutions that pass the given test cases, it picks the 10 ones with the best “score”
Submits up to 10 of the solutions until one of them passes
Steps 1 and 2 seem fine, but a human competitor in one of these contests would be penalized for step 3, which AlphaCode2 appears not to be[1]. Further the training set contamination combined with the fact that these are only “easier” div2 questions, imply that the solutions could very well appear in the test set and this just reconstructs that solution near verbatim.
In defense of AlphaCode 2, the fine-tuned scoring model that picks the 10 best might be a non trivial creation. It also seems. AC2 is more sample efficient than AC1, so it is getting better at generating solutions. Assuming nothing funky is happening with the training set, at the limit, this means 1 solution per sample.
Could be wrong, but if I am the paper should have made it more explicit
It seems to do something similar to Gato where everything is just serialized into tokens, which is pretty cool
I wonder if they are just doing a standard transformer for everything, or doing some sort of diffusion model for the images inside the model?
What does it mean for perception to compress a frame of video to 1k tokens? What kind of information gets lost when you do this?