This sounds like valid criticism, but also, isn’t the task of understanding which proteins/ligands are similar enough to each other to bind in the same way non-trivial in itself? If so, exploiting such similarities would require the model to do something substantially more sophisticated than just memorizing?
(ligand = drug-like molecule, for anyone else reading)
Right, I didn’t mean exact bitwise memory comparisons.
The dataset is redundant(ish), simply as an artifact of how it’s constructed:
For example, if people know that X binds A, and X ≈ Y, and A ≈ B, they’ll try to add X+B, Y+A and Y+B to the dataset also.
And this makes similarity-based predictions look artificially much more useful than they actually are, because in the “real world”, you will need to make predictions about dissimilar molecules from some collection.
This sounds like valid criticism, but also, isn’t the task of understanding which proteins/ligands are similar enough to each other to bind in the same way non-trivial in itself? If so, exploiting such similarities would require the model to do something substantially more sophisticated than just memorizing?
(ligand = drug-like molecule, for anyone else reading)
Right, I didn’t mean exact bitwise memory comparisons.
The dataset is redundant(ish), simply as an artifact of how it’s constructed:
For example, if people know that X binds A, and X ≈ Y, and A ≈ B, they’ll try to add X+B, Y+A and Y+B to the dataset also.
And this makes similarity-based predictions look artificially much more useful than they actually are, because in the “real world”, you will need to make predictions about dissimilar molecules from some collection.
I hope this makes sense.