Correct. In general, dP[θ]=p[θ]dθ is the probability density of θ, so if it’s uniform on a unit volume then p[θ]=1.
The main advantage of this notation is that it’s parameterization-independent. For example: in a coin-flipping example, we could have a uniform prior over the frequency of heads pH, so dP[pH]=dpH. But then, we could re-write that frequency in terms of the odds oH=pH1−pH, so we’d get pH=oH1+oH and
dP[oH]=dP[pH]=dpH=d(oH1+oH)=doH(1+oH)2
So the probability density p[pH]=1 is equivalent to the density p[oH]=1(1+oH)2. (That first step, dP[oH]=dP[pH], is because these two variables contain exactly the same information in two different forms—that’s the parameterization independence. After that, it’s math: substitute and differentiate.)
(Notice that the uniform prior on pH is not uniform over oH. This is one of the main reasons why “use a uniform prior” is not a good general-purpose rule for choosing priors: it depends on what parameters we choose. Cartesian and polar coordinate give different “uniform” priors.)
The moral of the story is that, when dealing with continuous probability densities, the fundamental “thing” is not the density function p[θ] but the density times the differential p[θ]dθ, which we call dP[θ]. This is important mainly when changing coordinates: if we have some coordinate change θ(ϕ), then p[θ(ϕ)]dθ(ϕ)=p[ϕ]dϕ, but p[θ(ϕ)]≠p[ϕ].
If anybody wants an exercise with this: try transforming ∫θeP[data|θ]dP[θ]=∫θeP[data|θ]p[θ]dθ to a different coordinate system. Apply Laplace’ approximation in both systems, and confirm that they yield the same answer. (This should mainly involve applying the chain rule twice to the Hessian; if you get stuck, remember that θmax is a maximum point and consider what that implies.)
Thanks for this sequence, I’ve read each post 3 or 4 times to try to properly get it.
Am I right in thinking that in order to replace dP[θ]=dθ we not only require a uniform prior but also that θ span unit volume?
Correct. In general, dP[θ]=p[θ]dθ is the probability density of θ, so if it’s uniform on a unit volume then p[θ]=1.
The main advantage of this notation is that it’s parameterization-independent. For example: in a coin-flipping example, we could have a uniform prior over the frequency of heads pH, so dP[pH]=dpH. But then, we could re-write that frequency in terms of the odds oH=pH1−pH, so we’d get pH=oH1+oH and
dP[oH]=dP[pH]=dpH=d(oH1+oH)=doH(1+oH)2
So the probability density p[pH]=1 is equivalent to the density p[oH]=1(1+oH)2. (That first step, dP[oH]=dP[pH], is because these two variables contain exactly the same information in two different forms—that’s the parameterization independence. After that, it’s math: substitute and differentiate.)
(Notice that the uniform prior on pH is not uniform over oH. This is one of the main reasons why “use a uniform prior” is not a good general-purpose rule for choosing priors: it depends on what parameters we choose. Cartesian and polar coordinate give different “uniform” priors.)
The moral of the story is that, when dealing with continuous probability densities, the fundamental “thing” is not the density function p[θ] but the density times the differential p[θ]dθ, which we call dP[θ]. This is important mainly when changing coordinates: if we have some coordinate change θ(ϕ), then p[θ(ϕ)]dθ(ϕ)=p[ϕ]dϕ, but p[θ(ϕ)]≠p[ϕ].
If anybody wants an exercise with this: try transforming ∫θeP[data|θ]dP[θ]=∫θeP[data|θ]p[θ]dθ to a different coordinate system. Apply Laplace’ approximation in both systems, and confirm that they yield the same answer. (This should mainly involve applying the chain rule twice to the Hessian; if you get stuck, remember that θmax is a maximum point and consider what that implies.)