But if it’s a poor split, wouldn’t it also favour the baseline (your software). But they did beat the baseline. So if your concern is correct, they did outperform the baseline, but just didn’t realistically measure generalisation to radically different structures.
So it’s not fair to say ‘it’s only memorisation’. It seems fairer to say ‘it doesn’t generalise enough to be docking software, and this is not obvious at first due to a poor choice of train test split’.
But if it’s a poor split, wouldn’t it also favour the baseline (your software). But they did beat the baseline. So if your concern is correct, they did outperform the baseline, but just didn’t realistically measure generalisation to radically different structures.
So it’s not fair to say ‘it’s only memorisation’. It seems fairer to say ‘it doesn’t generalise enough to be docking software, and this is not obvious at first due to a poor choice of train test split’.