This is a description of my work on some data science projects, lightly obfuscated and fictionalized to protect the confidentiality of the organizations I handled them for (and also to make it flow better). I focus on the high-level epistemic/mathematical issues, and the lived experience of working on intellectual problems, but gloss over the timelines and implementation details.
Data Leakage (n.): The use of information during Training and/or Evaluation which wouldn’t be available in Deployment.
The Upper Bound
One time, I was working for a company which wanted to win some first-place sealed-bid auctions in a market they were thinking of joining, and asked me to model the price-to-beat in those auctions. There was a twist: they were aiming for the low end of the market, and didn’t care about lots being sold for more than $1000.
“Okay,” I told them. “I’ll filter out everything with a price above $1000 before building any models or calculating any performance metrics!”
They approved of this, and told me it’d take a day or so to get the data ready. While I waited, I let my thoughts wander.
“Wait,” I told them the next morning. “That thing I said was blatantly insane and you’re stupid for thinking it made sense[1]. We wouldn’t know whether the price of a given lot would be >$1000 ahead of time, because predicting price is the entire point of this project. I can’t tell you off the top of my head what would go wrong or how wrong it would go, but it’s Leakage, there has to be a cost somewhere. How about this: I train on all available data, but only report performance for the lots predicted to be <$1000?”
They, to their great credit, agreed[2]. They then provided me with the dataset, alongside the predictive model they’d had some Very Prestigious Contractors make, which they wanted me to try and improve upon. After a quick look through their documentation, I found that the Very Prestigious Contractors had made the same mistake I had, and hadn’t managed to extricate themselves from it; among other things, this meant I got to see firsthand exactly how this Leakage damaged model performance.
If you make a model predicting a response from other factors, but feed it a dataset excluding responses over a certain ceiling, it’ll tend to underestimate, especially near the cutoff point; however, if you then test it on a dataset excluding the same rows, it’ll look like it’s overestimating, since it’s missing the rows it would underestimate. The end result of this was the Very Prestigious Contractors putting forth a frantic effort to make the Actual-vs-Predicted graphs line up (i.e. actively pushing things in the wrong direction), and despairing when no possible configuration of extra epicycles let them fit ‘correctly’ to their distorted dataset while keeping model complexity below agreed limits; their final report concluded with a sincere apology for not managing to screw up more than they did.
But I didn’t need to know things would break that exact way. I just needed to be able to detect Leakage.
The Time-Travelling Convention
Another time, I was working for another company which wanted to know how aggressively to bid in some first-place sealed-bid auctions, and asked me to model how much they were likely to make from each lot. There was no twist: they had a smallish but very clean dataset of things they’d won previously, various details about each lot which were available pre-auction, and how much money they’d made from them. Everything was normal and sensible.
“Okay,” I told them. “I’ll random-sample the data into training and testing sets, decide hyperparameters (and all the other model choices) by doing cross-validation inside the training set, then get final performance metrics by testing once on the testing set.”
They approved of this, and told me to get started. I took a little time to plan the project out.
“Wait,” I told them. “That thing I said was blatantly insane and you’re stupid for thinking it made sense[1]. If I get my training and testing sets by random-sampling, then I’ll be testing performance of a model trained (in part) on February 20XX data on a dataset consisting (in part) of January 20XX data: that’s time travel! I can’t tell you off the top of my head what would go wrong or how wrong it would go, but it’s Leakage, there has to be a cost somewhere. We should be doing a strict chronological split: train on January data, validate and optimize on February data, final-test on March data.”
The company responded with tolerant skepticism, mentioning that random splits were convention both for them and the wider industry; I replied that this was probably because everyone else was wrong and I was right[1]. They sensibly asked me to prove this assertion, demonstrating a meaningful difference between what they wanted doing and what I planned.
I looked into it and found . . . that the conventional approach worked fine. Context drift between training and deployment was small enough to be negligible, the ideal hyperparameters were the same regardless of what I did, and maintaining a strict arrow of time wasn’t worth the trouble of changing the company’s processes or the inconvenience of not being able to use conventional cross-validation. I was chastened by this result . . .
. . . until I looked into performance of chronological vs random splits on their how-much-will-this-lot-cost-us datasets, and found that chronological splits were meaningfully better there. It was several months after I proved this that I figured out why, and the mechanism—sellers auction several very similar lots in quick succession and then never auction again; random splits put some of those ‘clone’ lots in train and some in validation/test, incentivizing overfit; meanwhile, chronological splits kept everything in a given batch of clones on one side of the split—wasn’t anything I’d been expecting.
But I didn’t need to know things would break that exact way. I just needed to be able to detect Leakage (. . . and test whether it mattered).
The Tobit Problem
A third time, I was working for a third company which were winning some first-place sealed-bid auctions, and wanted to win more . . . actually, I’ve already written the story out here. Tl;dr: there was some Leakage but I spotted it (this time managing to describe pre hoc what damage it would do), came up with a fix which I thought wasn’t Leakage (but I thought it prudent to check what it did to model performance, and subsequently figured out where and how I’d been wrong), and then scrambled around frantically building an actually Leakage-proof solution.
My Takeaways
There is always a price for Leakage.
Often, the price is tolerably small, or already paid; if so, it’s entirely possible some Leakage is the least of the available evils. But it’s still (usually) worth checking.
Just because Leakage is tolerable in one context, that doesn’t mean it’s tolerable in a similar context.
“It’s what everyone does” and “It’s what we always do” are meaningful evidence that a given Leakage is more likely to be the bearable kind, but they don’t make something not Leakage, and they don’t provide any guarantees.
It’s usually easier to notice Leakage than to fully describe or quantify the damage it might do.
It’s sometimes possible to find Leakage by looking for damage done.
(Comparisons to bad-reasoning-in-general are left as an exercise for the reader.)
They also asked that I should report [# of lots predicted as <$1000] alongside my other performance metrics. This struck me as sensible paranoia: if they hadn’t added that stipulation, I could have just cheated my way to success by predicting which lots would be hard to predict and marking them as costing $9999.
Three Subtle Examples of Data Leakage
This is a description of my work on some data science projects, lightly obfuscated and fictionalized to protect the confidentiality of the organizations I handled them for (and also to make it flow better). I focus on the high-level epistemic/mathematical issues, and the lived experience of working on intellectual problems, but gloss over the timelines and implementation details.
Data Leakage (n.): The use of information during Training and/or Evaluation which wouldn’t be available in Deployment.
The Upper Bound
One time, I was working for a company which wanted to win some first-place sealed-bid auctions in a market they were thinking of joining, and asked me to model the price-to-beat in those auctions. There was a twist: they were aiming for the low end of the market, and didn’t care about lots being sold for more than $1000.
“Okay,” I told them. “I’ll filter out everything with a price above $1000 before building any models or calculating any performance metrics!”
They approved of this, and told me it’d take a day or so to get the data ready. While I waited, I let my thoughts wander.
“Wait,” I told them the next morning. “That thing I said was blatantly insane and you’re stupid for thinking it made sense[1]. We wouldn’t know whether the price of a given lot would be >$1000 ahead of time, because predicting price is the entire point of this project. I can’t tell you off the top of my head what would go wrong or how wrong it would go, but it’s Leakage, there has to be a cost somewhere. How about this: I train on all available data, but only report performance for the lots predicted to be <$1000?”
They, to their great credit, agreed[2]. They then provided me with the dataset, alongside the predictive model they’d had some Very Prestigious Contractors make, which they wanted me to try and improve upon. After a quick look through their documentation, I found that the Very Prestigious Contractors had made the same mistake I had, and hadn’t managed to extricate themselves from it; among other things, this meant I got to see firsthand exactly how this Leakage damaged model performance.
If you make a model predicting a response from other factors, but feed it a dataset excluding responses over a certain ceiling, it’ll tend to underestimate, especially near the cutoff point; however, if you then test it on a dataset excluding the same rows, it’ll look like it’s overestimating, since it’s missing the rows it would underestimate. The end result of this was the Very Prestigious Contractors putting forth a frantic effort to make the Actual-vs-Predicted graphs line up (i.e. actively pushing things in the wrong direction), and despairing when no possible configuration of extra epicycles let them fit ‘correctly’ to their distorted dataset while keeping model complexity below agreed limits; their final report concluded with a sincere apology for not managing to screw up more than they did.
But I didn’t need to know things would break that exact way. I just needed to be able to detect Leakage.
The Time-Travelling Convention
Another time, I was working for another company which wanted to know how aggressively to bid in some first-place sealed-bid auctions, and asked me to model how much they were likely to make from each lot. There was no twist: they had a smallish but very clean dataset of things they’d won previously, various details about each lot which were available pre-auction, and how much money they’d made from them. Everything was normal and sensible.
“Okay,” I told them. “I’ll random-sample the data into training and testing sets, decide hyperparameters (and all the other model choices) by doing cross-validation inside the training set, then get final performance metrics by testing once on the testing set.”
They approved of this, and told me to get started. I took a little time to plan the project out.
“Wait,” I told them. “That thing I said was blatantly insane and you’re stupid for thinking it made sense[1]. If I get my training and testing sets by random-sampling, then I’ll be testing performance of a model trained (in part) on February 20XX data on a dataset consisting (in part) of January 20XX data: that’s time travel! I can’t tell you off the top of my head what would go wrong or how wrong it would go, but it’s Leakage, there has to be a cost somewhere. We should be doing a strict chronological split: train on January data, validate and optimize on February data, final-test on March data.”
The company responded with tolerant skepticism, mentioning that random splits were convention both for them and the wider industry; I replied that this was probably because everyone else was wrong and I was right[1]. They sensibly asked me to prove this assertion, demonstrating a meaningful difference between what they wanted doing and what I planned.
I looked into it and found . . . that the conventional approach worked fine. Context drift between training and deployment was small enough to be negligible, the ideal hyperparameters were the same regardless of what I did, and maintaining a strict arrow of time wasn’t worth the trouble of changing the company’s processes or the inconvenience of not being able to use conventional cross-validation. I was chastened by this result . . .
. . . until I looked into performance of chronological vs random splits on their how-much-will-this-lot-cost-us datasets, and found that chronological splits were meaningfully better there. It was several months after I proved this that I figured out why, and the mechanism—sellers auction several very similar lots in quick succession and then never auction again; random splits put some of those ‘clone’ lots in train and some in validation/test, incentivizing overfit; meanwhile, chronological splits kept everything in a given batch of clones on one side of the split—wasn’t anything I’d been expecting.
But I didn’t need to know things would break that exact way. I just needed to be able to detect Leakage (. . . and test whether it mattered).
The Tobit Problem
A third time, I was working for a third company which were winning some first-place sealed-bid auctions, and wanted to win more . . . actually, I’ve already written the story out here. Tl;dr: there was some Leakage but I spotted it (this time managing to describe pre hoc what damage it would do), came up with a fix which I thought wasn’t Leakage (but I thought it prudent to check what it did to model performance, and subsequently figured out where and how I’d been wrong), and then scrambled around frantically building an actually Leakage-proof solution.
My Takeaways
There is always a price for Leakage.
Often, the price is tolerably small, or already paid; if so, it’s entirely possible some Leakage is the least of the available evils. But it’s still (usually) worth checking.
Just because Leakage is tolerable in one context, that doesn’t mean it’s tolerable in a similar context.
“It’s what everyone does” and “It’s what we always do” are meaningful evidence that a given Leakage is more likely to be the bearable kind, but they don’t make something not Leakage, and they don’t provide any guarantees.
It’s usually easier to notice Leakage than to fully describe or quantify the damage it might do.
It’s sometimes possible to find Leakage by looking for damage done.
(Comparisons to bad-reasoning-in-general are left as an exercise for the reader.)
I did not use these exact words.
They also asked that I should report [# of lots predicted as <$1000] alongside my other performance metrics. This struck me as sensible paranoia: if they hadn’t added that stipulation, I could have just cheated my way to success by predicting which lots would be hard to predict and marking them as costing $9999.