A piece of a certain large corporation’s spelling/grammar checker was at its heart Result <== Prior x Evidence. Due to legacy code, decaying institutional knowledge, etc., no one knew this. The code/math was strewn about many files. Folks had tweaked the code over the years, allowed parameters to vary, fit those parameters to data.
I read the code, realized that fundamentally it had to be “about” determining a prior, determining evidence, and computing a posterior, reconstructed the actual math being performed, and discovered that two exponents were different from each other, “best fit to the data”, and I couldn’t think of any good reason they should be different. Brain threw up all sorts of warning bells.
I examined how we trained on the data to determine the values we’d use for these exponents. Turns out, the process was completely unjustifiable, and only seemed to give better results because our test set was subtly part of the training set. Now that’s something everyone understands immediately; you don’t train on your test set. So we fixed our evaluation process, stopped allowing those particular parameters to float, and improved overall performance quite a bit.
Note, math. Because information and Bayes and correlation and such is unfortunately not simple, it’s entirely possible that some type of data is better served by e^(a*ln(P(v|w))-b*ln(P(v|~w))) where a!=b!=1. I dunno. But if you see someone obviously only introducing a and b and then fitting them to data because :shrug: why not, that’s when your red flags go up and you realize someone’s put this thing together without paying attention. And in this case after fixing the evaluation we did end up leaving a==b != 1, which is just Naive Bayes, basically. a!=b was the really disconcerting bit.
A piece of a certain large corporation’s spelling/grammar checker was at its heart Result <== Prior x Evidence. Due to legacy code, decaying institutional knowledge, etc., no one knew this. The code/math was strewn about many files. Folks had tweaked the code over the years, allowed parameters to vary, fit those parameters to data.
I read the code, realized that fundamentally it had to be “about” determining a prior, determining evidence, and computing a posterior, reconstructed the actual math being performed, and discovered that two exponents were different from each other, “best fit to the data”, and I couldn’t think of any good reason they should be different. Brain threw up all sorts of warning bells.
I examined how we trained on the data to determine the values we’d use for these exponents. Turns out, the process was completely unjustifiable, and only seemed to give better results because our test set was subtly part of the training set. Now that’s something everyone understands immediately; you don’t train on your test set. So we fixed our evaluation process, stopped allowing those particular parameters to float, and improved overall performance quite a bit.
Note, math. Because information and Bayes and correlation and such is unfortunately not simple, it’s entirely possible that some type of data is better served by e^(a*ln(P(v|w))-b*ln(P(v|~w))) where a!=b!=1. I dunno. But if you see someone obviously only introducing a and b and then fitting them to data because :shrug: why not, that’s when your red flags go up and you realize someone’s put this thing together without paying attention. And in this case after fixing the evaluation we did end up leaving a==b != 1, which is just Naive Bayes, basically. a!=b was the really disconcerting bit.