Ponder Stibbons comments on Three Subtle Examples of Data Leakage

Ponder Stibbons 3 Oct 2024 12:47 UTC
10 points
5
The other day, during an after-symposium discussion on detecting BS AI/ML papers, one of my colleagues suggested doing a text search for “random split” as a good test.
- J Bostock 4 Oct 2024 21:58 UTC
  6 points
  0
  Parent
  A paper I’m doing mech interp on used a random split when the dataset they used already has a non-random canonical split. They also validated with their test data (the dataset has a three way split) and used the original BERT architecture (sinusoidal embeddings which are added to feedforward, post-norming, no MuP) in a paper that came out in 2024. Training batch size is so small it can be 4xed and still fit on my 16GB GPU. People trying to get into ML from the science end have got no idea what they’re doing. It was published in Bioinformatics.