J Bostock comments on Three Subtle Examples of Data Leakage

J Bostock 3 Oct 2024 10:29 UTC
16 points
4
sellers auction several very similar lots in quick succession and then never auction again
This is also extremely common in biochem datasets. You’ll get results in groups of very similar molecules, and families of very similar protein structures. If you do a random train/test split your model will look very good but actually just be picking up on coarse features.
- Ponder Stibbons 3 Oct 2024 12:47 UTC
  10 points
  5
  Parent
  The other day, during an after-symposium discussion on detecting BS AI/ML papers, one of my colleagues suggested doing a text search for “random split” as a good test.
  - J Bostock 4 Oct 2024 21:58 UTC
    6 points
    0
    Parent
    A paper I’m doing mech interp on used a random split when the dataset they used already has a non-random canonical split. They also validated with their test data (the dataset has a three way split) and used the original BERT architecture (sinusoidal embeddings which are added to feedforward, post-norming, no MuP) in a paper that came out in 2024. Training batch size is so small it can be 4xed and still fit on my 16GB GPU. People trying to get into ML from the science end have got no idea what they’re doing. It was published in Bioinformatics.