This would definitely teach something,but I’m not sold that it actually teaches useful skills of detecting weaknesses in papers. Failures in research are drawn from their own special distribution, which is very different than the sampling-from-GPT distribution.
Part of this can be blamed on GPT-3 not being smart enough—it doesn’t understand (e.g.) magnetic permeability, and so in trying to write about it it will inevitably make mistakes that no human academic would. But then if our language model were smart enough to talk convincingly about magnetic permeability, the journal club members would still be stuck looking for tells that are going to be inhuman unless you’ve somehow assembled a dataset of bad papers to train a classifier on.
I think that doing this with real papers (that have failed to stand the test of time, but grad students probably won’t know that) is actually a lot better, because their mistakes are drawn from the distribution you actually need to learn. It also provides you with a richer supervised signal—you can learn not only that a paper was wrong, but also what process led to it having the contents it did, given that it didn’t reflect reality.
A database of such teaching examples, submitted by professors, would be interesting but would probably get very contentious.
This would definitely teach something, but I’m not sold that it actually teaches useful skills of detecting weaknesses in papers. Failures in research are drawn from their own special distribution, which is very different than the sampling-from-GPT distribution.
Part of this can be blamed on GPT-3 not being smart enough—it doesn’t understand (e.g.) magnetic permeability, and so in trying to write about it it will inevitably make mistakes that no human academic would. But then if our language model were smart enough to talk convincingly about magnetic permeability, the journal club members would still be stuck looking for tells that are going to be inhuman unless you’ve somehow assembled a dataset of bad papers to train a classifier on.
I think that doing this with real papers (that have failed to stand the test of time, but grad students probably won’t know that) is actually a lot better, because their mistakes are drawn from the distribution you actually need to learn. It also provides you with a richer supervised signal—you can learn not only that a paper was wrong, but also what process led to it having the contents it did, given that it didn’t reflect reality.
A database of such teaching examples, submitted by professors, would be interesting but would probably get very contentious.