(ligand = drug-like molecule, for anyone else reading)
Right, I didn’t mean exact bitwise memory comparisons.
The dataset is redundant(ish), simply as an artifact of how it’s constructed:
For example, if people know that X binds A, and X ≈ Y, and A ≈ B, they’ll try to add X+B, Y+A and Y+B to the dataset also.
And this makes similarity-based predictions look artificially much more useful than they actually are, because in the “real world”, you will need to make predictions about dissimilar molecules from some collection.
(ligand = drug-like molecule, for anyone else reading)
Right, I didn’t mean exact bitwise memory comparisons.
The dataset is redundant(ish), simply as an artifact of how it’s constructed:
For example, if people know that X binds A, and X ≈ Y, and A ≈ B, they’ll try to add X+B, Y+A and Y+B to the dataset also.
And this makes similarity-based predictions look artificially much more useful than they actually are, because in the “real world”, you will need to make predictions about dissimilar molecules from some collection.
I hope this makes sense.