Spoiler alert, using a lot of tedious math (that we honestly do not fully follow)
I had trouble with this too. I asked on math.stackexchange about part of the derivation of the product rule. Apparently I understood the answer at the time, but that was long ago.
the most extreme is the ‘A implies B’ which many people consider equivalent to ‘B implies A’ where the correct logical rule is ‘not B implies not A’. Jaynes argues this is an example of Bayesian reasoning. For example if all dogs are mammals (deduction) it also means that some mammals are dogs, concretely if only 20% of animals are mammals this information increased the likelihood of dogs by a factor of 5!. A implies B means that B increases the likelihood of A by a factor of 5.
Had trouble following this because “animals” are mentioned only once, so to elaborate: we’re taking the background info to be X = “this thing is an animal”, along with A = “this thing is a dog” and B = “this thing is a mammal”. Then A⇒B is P(B|AX)=1, and if P(B|X)=0.2 then
P(A|BX)=P(B|AX)P(A|X)P(B|X)=5P(A|X)
Jaynes’ very physicist-oriented example concerns the emission of particles from a radioactive source and a sensor that can detect some portion of these particles. The radioactive source emits on average s particles per second drawn from a Poisson distribution and the sensor detects particles with an Bernoulli rate of ϕ. So if ϕ is 0.1, the sensor picks up 10% of particles over the long run, but if you have just n=10 particles, there is no guarantee that it will detect exactly 1 particle. Similarly, even though we might have s=100 particles per second, any given second might have more or less than 100 particles emitted. This is where it gets complicated. If you use the MLE estimate, you will always get ϕs particles as your estimate for each second of counts, because MLE ‘assumes’ that the 10:1 particle relationship is fixed and thus ignores the Poisson source variability. So let’s say you have a counter with ϕ=0.1 and have observed a count of 15 particles on this sensor for some second. How many particles, n, have originated from the source during this second? MLE will get you 150 particles, as described above. But Jayne’s robot gives us 105 particles. What? This is a HUGE difference! This example also surprised experimental physicists. The reason the robot gets 105 and not 150 is because the source has lower variability than the detector, so a high number is weak evidence of an above average number of particles.
Had trouble following this too. I thought we were trying to estimate s and/or φ. But that’s not it; we know s = 100 and φ = 0.1, and we know we detected 15 particles, and the question is how many were emitted.
And if we do an MLE, we’d say “well, the number-emitted that gives us the highest probability of detecting 15 is 150”, so that’s the estimate. We’re throwing away what we know about the distribution of how-many-emitted.
And I guess we could instead ask “what fraction of emitted particles did we detect?” Presumably then we throw away what we know about the distribution of what-fraction-detected, and we say “well, the fraction-detected that gives us the highest probability of detecting 15 is 0.15″, so that would be the MLE.
Which gives us another way to see that MLE is silly, because “what fraction did we detect” and “how many were emitted” are the same question, given that we detected 15.
I had trouble with this too. I asked on math.stackexchange about part of the derivation of the product rule. Apparently I understood the answer at the time, but that was long ago.
Had trouble following this because “animals” are mentioned only once, so to elaborate: we’re taking the background info to be X = “this thing is an animal”, along with A = “this thing is a dog” and B = “this thing is a mammal”. Then A⇒B is P(B|AX)=1, and if P(B|X)=0.2 then
P(A|BX)=P(B|AX)P(A|X)P(B|X)=5P(A|X)Had trouble following this too. I thought we were trying to estimate s and/or φ. But that’s not it; we know s = 100 and φ = 0.1, and we know we detected 15 particles, and the question is how many were emitted.
And if we do an MLE, we’d say “well, the number-emitted that gives us the highest probability of detecting 15 is 150”, so that’s the estimate. We’re throwing away what we know about the distribution of how-many-emitted.
And I guess we could instead ask “what fraction of emitted particles did we detect?” Presumably then we throw away what we know about the distribution of what-fraction-detected, and we say “well, the fraction-detected that gives us the highest probability of detecting 15 is 0.15″, so that would be the MLE.
Which gives us another way to see that MLE is silly, because “what fraction did we detect” and “how many were emitted” are the same question, given that we detected 15.