Dweomite comments on Three Subtle Examples of Data Leakage

Dweomite 4 Oct 2024 0:32 UTC
9 points
3
How about this: I train on all available data, but only report performance for the lots predicted to be <$1000?
This still feels squishy to me (even after your footnote about separately tracking how many lots were predicted <$1000). You’re giving the model partial control over how the model is tested.
The only concrete abuse I can immediately come up with is that maybe it cheats like you predicted by submitting artificially high estimates for hard-to-estimate cases, but you miss it because it also cheats in the other direction by rounding down its estimates for easier-to-predict lots that are predicted to be just slightly over $1000.
But just like you say that it’s easier to notice leakage than to say exactly how (or how much) it’ll matter, I feel like we should be able to say “you’re giving the model partial control over which problems the model is evaluated on, this seems bad” without necessarily predicting how it will matter.
My instinct would be to try to move the grading closer to the model’s ultimate impact on the client’s interests. For example, if you can determine what each lot in your data set was “actually worth (to you)”, then perhaps you could calculate how much money would be made or lost if you’d submitted a given bid (taking into account whether that bid would’ve won), and then train the model to find a bidding strategy with the highest expected payout.
But I can imagine a lot of reasons you might not actually be able to do that: maybe you don’t know the “actual worth” in your training set, maybe unsuccessful bids have a hard-to-measure opportunity cost, maybe you want the model to do something simpler so that it’s more likely to remain useful if your circumstances change.
Also you sound like you do this for a living so I have about 30% probability you’re going to tell me that my concerns are wrong-headed for some well-studied reason I’ve never heard of.