They chuck out larger files from Github (1MB+), or with lines longer than 1000 characters to exclude automatically generated code. I’m guessing the former is because programs that’s too long just aren’t useful when your model’s context windows are tiny in comparison. They did also get rid of duplicates. Plus, they needed to avoid code published after the questions in the dataset were made, to avoid leaking the answer.
As to natural language training, I suppose I’d agree that it is some evidence against the claim. But I’d say it would be strong evidence if they also trained Alphafold on MassiveText and found little to no performance increase. I wouldn’t be surprised if it didn’t do much though.
Edit: Oh, I just saw Table 7. Yeah, that’s pretty strong evidence against the claim that natural language corpuses are anywhere near as useful as code corpuses for this kind of stuff. But I am a little surpised that it added as much as it did.
They chuck out larger files from Github (1MB+), or with lines longer than 1000 characters to exclude automatically generated code. I’m guessing the former is because programs that’s too long just aren’t useful when your model’s context windows are tiny in comparison. They did also get rid of duplicates. Plus, they needed to avoid code published after the questions in the dataset were made, to avoid leaking the answer.
As to natural language training, I suppose I’d agree that it is some evidence against the claim. But I’d say it would be strong evidence if they also trained Alphafold on MassiveText and found little to no performance increase. I wouldn’t be surprised if it didn’t do much though.
Edit: Oh, I just saw Table 7. Yeah, that’s pretty strong evidence against the claim that natural language corpuses are anywhere near as useful as code corpuses for this kind of stuff. But I am a little surpised that it added as much as it did.