Automated plagerism detection software is common. But cases like the recent incident with Harvard administrator Gay have shown that egregious cases of plagerism are still being uncovered. Why would this be the case? Is it really so hard to run plagerism checks for every paper on Sci-hub? Has anyone tried?
I am curious since I am currently upskilling for the purposes of technical alignment research and it seems like an interesting project to pursue.
“The total size of Sci-Hub database is about 100 TB.”
i.e. $1000-$2000 in drive space, or $20 / day to store on Backblaze if you don’t anticipate needing it for more than a couple of months tops.
You’re correct that simply storing the entire database isn’t infeasible. But as I understand it, that’s large enough that training a model on that is too expensive for most hobbyists to do just for kicks.
Depends on how big of a model you’re trying to train, and how you’re trying to train it.
I was imagining something along the lines of “download the full 100TB torrent which includes 88M articles, extract the text of each article (“extract text from a given PDF” isn’t super reliable but it should be largely doable), which should leave you somewhere in the ballpark of 4TB of uncompressed plain text. If you’re using a BPE, that would leave you with ~1T tokens.
If you’re trying to do the chinchilla optimality thing, I fully agree that there’s no way you’re going to be able to do that with the compute budget available to mere mortals. If you’re trying to do the “generate embeddings for every paragraph of every paper, and do similarity searches, and then on matches calculate edit distance to see if it was literally copy-pasted” I think that’d be entirely doable with a hobbyist budget.
I personally think it’d be a great learning project.
I think there are two reasons it’s not more common to retroactively analyze papers and publications for copied or closely-paraphrased segments.
First, it’s not actually easy to automate. Current solutions are RIFE with false-positives and human judgement requirements to make final conclusions.
Second, and perhaps more importantly, nobody really cares, outside of graded work where the organization is basing your credentials on doing original work (but usually not even that, just semi-original presentation of other works).
It would probably be a minor scandal if any significant papers were discovered to be based on uncredited/un-footnoted other work, but unless it were egregious (in which case it probably would have already been noticed), just not that big a deal.
Distinguishing between a properly cited paraphrase and taking someone’s work as your own without sufficient attribution is not trivial even for people. There’s a lot of grey area in terms of how closely you can mimic the original before it becomes problematic (this is largely what I’ve seen Rufo trying to hang the Harvard admin woman with, paraphrases that maintained a lot of the original wording which were nonetheless clearly cited, which at least to me seem like bad practice but not actually plagiarism in the sense it is generally meant) and it comes down to a judgement call in the edge cases.
A professor I know fell afoul of an automated plagiarism detector because it pinged on some of her own previous papers on the same subject, and the journal refused to reconsider. Felt very silly, like they were asking her to go through and arbitrarily change the wording she thought was best just because she had used it before because the computer said so. I think she ultimately ended up submitting to a different journal and it got accepted there.