Suppose we have a bunch of earthquake sensors spread over an area. They are not perfectly reliable (in terms of either false positives or false negatives), but some are more reliable than others. How can we aggregate the sensor data to detect earthquakes?
A “naive” seismologist without any statistics background might try assigning different numerical scores to each sensor, roughly indicating how reliable their positive and negative results are, just based on the seismologist’s intuition. Sensor i gets a score s+i when it’s going off, and s−i when it’s not. Then, the seismologist can add up the s+i scores for all sensors going off at a given time, plus the s−i scores for sensors not going off, to get an aggregate “earthquake score”. Assuming the seismologist has decent intuitions for the sensors, this will probably work just fine.
It turns out that this procedure is equivalent to a Naive Bayes model.
Naive Bayes is a causal model in which there is some parameter θ in the environment which we want to know about—i.e. whether or not there’s an earthquake happening. We can’t observe θ directly, but we can measure it indirectly via some data {xi} - i.e. outputs from the earthquake sensors. The measurements may not be perfectly accurate, but their failures are at least independent—one sensor isn’t any more or less likely to be wrong when another sensor is wrong.
We can represent this picture with a causal diagram:
From the diagram, we can read off the model’s equation: P[θ,{xi}]=P[θ]∏iP[xi|θ]. We’re interested mainly in the posterior probability P[θ|{xi}]=1ZP[θ]∏iP[xi|θ] or, in log odds form,
L[θ|{xi}]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]
Stare at that equation, and it’s not hard to see how the seismologist’s procedure turns into a Naive Bayes model: the seismologist’s intuitive scores for each sensor correspond to the “evidence” from the sensor lnP[xi|θ]P[xi|∼θ]. The “earthquake score” then corresponds to the posterior log odds of an earthquake. The seismologist has unwittingly adopted a statistical model. Note that this is still true regardless of whether the scores used are well-calibrated or whether the assumptions of the model hold—the seismologist is implicitly using this model, and whether the model is correct is an entirely separate question.
The Embedded Naive Bayes Equation
Let’s formalize this a bit.
We have some system which takes in data x, computes some stuff, and spits out some f(x). We want to know whether a Naive Bayes model is embedded in f(x). Conceptually, we imagine that f(x) parameterizes a probability distribution over some unobserved parameter θ - we’ll write P[θ;f(x)], where the “;” is read as “parameterized by”. For instance, we could imagine a normal distribution over θ, in which case f(x) might be the mean and variance (or any encoding thereof) computed from our input data. In our earthquake example, θ is a binary variable, so f(x) is just some encoding of the probability that θ=True.
Now let’s write the actual equation defining an embedded Naive Bayes model. We assert that P[θ;f(x)] is the same as P[θ|x] under the model, i.e.
P[θ;f(x)]=P[θ|x]=1ZP[θ]∏iP[xi|θ]
We can transform to log odds form to get rid of the Z:
L[θ;f(x)]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]
Let’s pause for a moment and go through that equation. We know the function f(x), and we want the equation to hold for all values of x. θ is some hypothetical thing out in the environment—we don’t know what it corresponds to, we just hypothesize that the system is modelling something it can’t directly observe. As with x, we want the equation to hold for all values of θ. The unknowns in the equation are the probability functions P[θ;f(x)], P[θ] and P[xi|θ]. To make it clear what’s going on, let’s remove the probability notation for a moment, and just use functions G and {gi}, with θ written as a subscript:
∀θ,x:Gθ(f(x))=cθ+∑igθi(xi)
This is a functional equation: for each value of θ, we want to find functions G, {gi}, and a constant c such that the equation holds for all possible x values. The solutions G and {gi} can then be decoded to give our probability functions P[θ;f(x)] and P[xi|θ], while c can be decoded to give our prior P[θ]. Each possible θ-value corresponds to a different set of solutions Gθ, {gθi}, cθ.
This particular functional equation is a variant of Pexider’s equation; you can read all about it in Aczel’s Functional Equations and Their Applications, chapter 3. For our purposes, the most important point is: depending on the function f, the equation may or may not have a solution. In other words, there is a meaningful sense in which some functions f(x)do embed a Naive Bayes model, and others do not. Our seismologist’s procedure does embed a Naive Bayes model: let G be the identity function, c be zero, and gi(xi)=sxii, and we have a solution to the embedding equation with f(x) given by our seismologist’s add-all-the-scores calculation (although this is not the only solution). On the other hand, a procedure computing f(x)=xxx321 for real-valued inputs x1, x2, x3 would not embed a Naive Bayes model: with this f(x), the embedding equation would not have any solutions.
Embedded Naive Bayes
Suppose we have a bunch of earthquake sensors spread over an area. They are not perfectly reliable (in terms of either false positives or false negatives), but some are more reliable than others. How can we aggregate the sensor data to detect earthquakes?
A “naive” seismologist without any statistics background might try assigning different numerical scores to each sensor, roughly indicating how reliable their positive and negative results are, just based on the seismologist’s intuition. Sensor i gets a score s+i when it’s going off, and s−i when it’s not. Then, the seismologist can add up the s+i scores for all sensors going off at a given time, plus the s−i scores for sensors not going off, to get an aggregate “earthquake score”. Assuming the seismologist has decent intuitions for the sensors, this will probably work just fine.
It turns out that this procedure is equivalent to a Naive Bayes model.
Naive Bayes is a causal model in which there is some parameter θ in the environment which we want to know about—i.e. whether or not there’s an earthquake happening. We can’t observe θ directly, but we can measure it indirectly via some data {xi} - i.e. outputs from the earthquake sensors. The measurements may not be perfectly accurate, but their failures are at least independent—one sensor isn’t any more or less likely to be wrong when another sensor is wrong.
We can represent this picture with a causal diagram:
From the diagram, we can read off the model’s equation: P[θ,{xi}]=P[θ]∏iP[xi|θ]. We’re interested mainly in the posterior probability P[θ|{xi}]=1ZP[θ]∏iP[xi|θ] or, in log odds form,
L[θ|{xi}]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]
Stare at that equation, and it’s not hard to see how the seismologist’s procedure turns into a Naive Bayes model: the seismologist’s intuitive scores for each sensor correspond to the “evidence” from the sensor lnP[xi|θ]P[xi|∼θ]. The “earthquake score” then corresponds to the posterior log odds of an earthquake. The seismologist has unwittingly adopted a statistical model. Note that this is still true regardless of whether the scores used are well-calibrated or whether the assumptions of the model hold—the seismologist is implicitly using this model, and whether the model is correct is an entirely separate question.
The Embedded Naive Bayes Equation
Let’s formalize this a bit.
We have some system which takes in data x, computes some stuff, and spits out some f(x). We want to know whether a Naive Bayes model is embedded in f(x). Conceptually, we imagine that f(x) parameterizes a probability distribution over some unobserved parameter θ - we’ll write P[θ;f(x)], where the “;” is read as “parameterized by”. For instance, we could imagine a normal distribution over θ, in which case f(x) might be the mean and variance (or any encoding thereof) computed from our input data. In our earthquake example, θ is a binary variable, so f(x) is just some encoding of the probability that θ=True.
Now let’s write the actual equation defining an embedded Naive Bayes model. We assert that P[θ;f(x)] is the same as P[θ|x] under the model, i.e.
P[θ;f(x)]=P[θ|x]=1ZP[θ]∏iP[xi|θ]
We can transform to log odds form to get rid of the Z:
L[θ;f(x)]=lnP[θ]P[∼θ]+∑ilnP[xi|θ]P[xi|∼θ]
Let’s pause for a moment and go through that equation. We know the function f(x), and we want the equation to hold for all values of x. θ is some hypothetical thing out in the environment—we don’t know what it corresponds to, we just hypothesize that the system is modelling something it can’t directly observe. As with x, we want the equation to hold for all values of θ. The unknowns in the equation are the probability functions P[θ;f(x)], P[θ] and P[xi|θ]. To make it clear what’s going on, let’s remove the probability notation for a moment, and just use functions G and {gi}, with θ written as a subscript:
∀θ,x:Gθ(f(x))=cθ+∑igθi(xi)
This is a functional equation: for each value of θ, we want to find functions G, {gi}, and a constant c such that the equation holds for all possible x values. The solutions G and {gi} can then be decoded to give our probability functions P[θ;f(x)] and P[xi|θ], while c can be decoded to give our prior P[θ]. Each possible θ-value corresponds to a different set of solutions Gθ, {gθi}, cθ.
This particular functional equation is a variant of Pexider’s equation; you can read all about it in Aczel’s Functional Equations and Their Applications, chapter 3. For our purposes, the most important point is: depending on the function f, the equation may or may not have a solution. In other words, there is a meaningful sense in which some functions f(x) do embed a Naive Bayes model, and others do not. Our seismologist’s procedure does embed a Naive Bayes model: let G be the identity function, c be zero, and gi(xi)=sxii, and we have a solution to the embedding equation with f(x) given by our seismologist’s add-all-the-scores calculation (although this is not the only solution). On the other hand, a procedure computing f(x)=xxx321 for real-valued inputs x1, x2, x3 would not embed a Naive Bayes model: with this f(x), the embedding equation would not have any solutions.