Another PT:LoS question. In Chapter 8 (“Sufficiency, Ancillarity and all that”), there’s a section Fisher information. I’m very interested in understanding it, because the concept has come up in improtant places in my statistics classes, without any conceptual discussion of it—it’s in the Cramer-Rao bound and the Jeffreys prior, but it looks so arbitrary to me.
Jaynes’s explanation of it as a difference in the information different parameter values give you about large samples is really interesting, but there’s one step of the math that I just can’t follow. He does what looks like a second-order taylor approximation of log p(x|theta), but there’s no first-order term and the second-order term is negative for some reason?! What happened there?
but there’s no first-order term and the second-order term is negative for some reason?! What happened there?
There’s no first-order term because you are expanding around a maximum of the log posterior density. Similarly, the second-order term is negative (well, negative definite) precisely because the posterior density falls off away from the mode. What’s happening in rough terms is that each additional piece of data has, in expectation, the effect of making the log posterior curve down more sharply (around the true value of the parameter) by the amount of one copy of the Fisher information matrix (this is all assuming the model is true, etc.). You might also be interested in the concept of “observed information,” which represents the negative of the Hessian of the (actual not expected) log-likelihood around the mode.
Yes, that formula doesn’t make sense (you forgot the 1⁄2, by the way). I believe 8.52/8.53 should not have a minus there and 8.54 should have a minus that it’s missing. Also 8.52 should have expected values or big-O probability notation. This is a frequentist calculation so I’d suggest a more standard reference like Ferguson
Another PT:LoS question. In Chapter 8 (“Sufficiency, Ancillarity and all that”), there’s a section Fisher information. I’m very interested in understanding it, because the concept has come up in improtant places in my statistics classes, without any conceptual discussion of it—it’s in the Cramer-Rao bound and the Jeffreys prior, but it looks so arbitrary to me.
Jaynes’s explanation of it as a difference in the information different parameter values give you about large samples is really interesting, but there’s one step of the math that I just can’t follow. He does what looks like a second-order taylor approximation of log p(x|theta), but there’s no first-order term and the second-order term is negative for some reason?! What happened there?
There’s no first-order term because you are expanding around a maximum of the log posterior density. Similarly, the second-order term is negative (well, negative definite) precisely because the posterior density falls off away from the mode. What’s happening in rough terms is that each additional piece of data has, in expectation, the effect of making the log posterior curve down more sharply (around the true value of the parameter) by the amount of one copy of the Fisher information matrix (this is all assuming the model is true, etc.). You might also be interested in the concept of “observed information,” which represents the negative of the Hessian of the (actual not expected) log-likelihood around the mode.
ah, thank you! It makes me so happy to finally see why that first term disappears.
But now I don’t see why you subtract the second-order terms.
I mean, I do see that since you’re at a maximum, the value of the function has to decrease as you move away from it.
But, in the single-parameter case, Jaynes’s formula becomes
}=\log{p(x%7C\theta_0)}%20-%20\frac{\partial%5E2%20\log{p(x%7C\theta)}}{\partial%20\theta%5E2}(\delta\theta)%5E2)But that second derivative there is negative. And since we’re subtracting it, the function is growing as we move away from the minimum!
Yes, that formula doesn’t make sense (you forgot the 1⁄2, by the way). I believe 8.52/8.53 should not have a minus there and 8.54 should have a minus that it’s missing. Also 8.52 should have expected values or big-O probability notation. This is a frequentist calculation so I’d suggest a more standard reference like Ferguson