Towards whatever-you-call-the-thing-I-got-from-reading-LW ism:
Thinking about all information systems as fundamentally performing the computation “Result <== Prior x Evidence” has been responsible for 5 of the 7 biggest successes in my career thus far. The other 2 had nothing to do with math/information/probability. All of the 5 were me noticing, where many better educated “more qualified” individuals did not, that some part of the actual information system’s implementation was broken w.r.t. “Result <== Prior x Evidence” and figuring out how to phrase the brokenness in some other way (’cause inferential distance), resulting in institutional pressure to fix it.
A piece of a certain large corporation’s spelling/grammar checker was at its heart Result <== Prior x Evidence. Due to legacy code, decaying institutional knowledge, etc., no one knew this. The code/math was strewn about many files. Folks had tweaked the code over the years, allowed parameters to vary, fit those parameters to data.
I read the code, realized that fundamentally it had to be “about” determining a prior, determining evidence, and computing a posterior, reconstructed the actual math being performed, and discovered that two exponents were different from each other, “best fit to the data”, and I couldn’t think of any good reason they should be different. Brain threw up all sorts of warning bells.
I examined how we trained on the data to determine the values we’d use for these exponents. Turns out, the process was completely unjustifiable, and only seemed to give better results because our test set was subtly part of the training set. Now that’s something everyone understands immediately; you don’t train on your test set. So we fixed our evaluation process, stopped allowing those particular parameters to float, and improved overall performance quite a bit.
Note, math. Because information and Bayes and correlation and such is unfortunately not simple, it’s entirely possible that some type of data is better served by e^(a*ln(P(v|w))-b*ln(P(v|~w))) where a!=b!=1. I dunno. But if you see someone obviously only introducing a and b and then fitting them to data because :shrug: why not, that’s when your red flags go up and you realize someone’s put this thing together without paying attention. And in this case after fixing the evaluation we did end up leaving a==b != 1, which is just Naive Bayes, basically. a!=b was the really disconcerting bit.
Towards whatever-you-call-the-thing-I-got-from-reading-LW ism:
Thinking about all information systems as fundamentally performing the computation “Result <== Prior x Evidence” has been responsible for 5 of the 7 biggest successes in my career thus far. The other 2 had nothing to do with math/information/probability. All of the 5 were me noticing, where many better educated “more qualified” individuals did not, that some part of the actual information system’s implementation was broken w.r.t. “Result <== Prior x Evidence” and figuring out how to phrase the brokenness in some other way (’cause inferential distance), resulting in institutional pressure to fix it.
Are there any of them you could explain? It would be interesting to hear how that caches out in real life.
A piece of a certain large corporation’s spelling/grammar checker was at its heart Result <== Prior x Evidence. Due to legacy code, decaying institutional knowledge, etc., no one knew this. The code/math was strewn about many files. Folks had tweaked the code over the years, allowed parameters to vary, fit those parameters to data.
I read the code, realized that fundamentally it had to be “about” determining a prior, determining evidence, and computing a posterior, reconstructed the actual math being performed, and discovered that two exponents were different from each other, “best fit to the data”, and I couldn’t think of any good reason they should be different. Brain threw up all sorts of warning bells.
I examined how we trained on the data to determine the values we’d use for these exponents. Turns out, the process was completely unjustifiable, and only seemed to give better results because our test set was subtly part of the training set. Now that’s something everyone understands immediately; you don’t train on your test set. So we fixed our evaluation process, stopped allowing those particular parameters to float, and improved overall performance quite a bit.
Note, math. Because information and Bayes and correlation and such is unfortunately not simple, it’s entirely possible that some type of data is better served by e^(a*ln(P(v|w))-b*ln(P(v|~w))) where a!=b!=1. I dunno. But if you see someone obviously only introducing a and b and then fitting them to data because :shrug: why not, that’s when your red flags go up and you realize someone’s put this thing together without paying attention. And in this case after fixing the evaluation we did end up leaving a==b != 1, which is just Naive Bayes, basically. a!=b was the really disconcerting bit.