Very loosely speaking, regions with a low RLCT have a larger “volume” than regions with high RLCT, and the impact of this fact eventually dominates other relevant factors.
I’m going to make a few comments as I read through this, but first I’d like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn’t have done otherwise.
Regarding the point about volume. It is true that the RLCT can be written as (Theorem 7.1 of Watanabe’s book “Algebraic Geometry and Statistical Learning Theory”)
λ=limt→0log(V(at)/V(t))loga
where V(t)=∫K(w)<tφ(w)dw is the volume (according to the measure associated to the prior) of the set of parameters w with KL divergence K(w) between the model and truth less than t. For small t we have V(t)≈ctλ(−logt)m−1 where m is the multiplicity. Thus near critical points w∗ with lower RLCT small changes in the cutoff t near t≈0 tend to change the volume of the set of almost true parameters more than near critical points with higher RLCTs.
My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space, and so you read the asymptotic formula for the free energy (where Wα is a region of parameter space containing a critical point w∗α)
Fn(Wα)≈nLn(w∗α)+λαlogn−(m−1)loglogn+OP(1)
as having a logn term that does little more than prefer critical points w∗α that tend to dominate large regions of parameter space according to the prior. If that were true, I would agree this would be underwhelming (or at least, precisely as “whelming” as the BIC, and therefore not adding much beyond the classical story).
However this isn’t what the free energy formula says. Indeed the volume ∫Wαφ(w)dw is a term that contributes only to the constant order term (this is sketched in Chen et al).
I claim it’s better to think of the learning coefficient λ as being a measure of how many bits it takes to specify an almost true parameter with K(w)<1n+1 once you know a parameter with K(w)<1/n, which is “microscopic” rather than “macroscopic” statement. That is, lower λ means that a fixed decrease ΔK is “cheaper” in terms of entropy generated.
So the free energy formula isn’t saying “critical points w∗α dominating large regions tend to dominate the posterior at large n” but rather “critical points w∗α which require fewer bits / less entropy to achieve a fixed ΔK dominate the posterior for large n”. The former statement is both false and uninteresting, the second statement is true and interesting (or I think so anyway).
I’m going to make a few comments as I read through this, but first I’d like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn’t have done otherwise.
Thank you for the detailed responses! I very much enjoy discussing these topics :)
My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space
My intuitions around the RLCT are very much geometrically informed, and I do think of it as being a kind of flatness measure. However, I don’t think of it as being a “macroscopic” quantity, but rather, a local quantity.
I think the rest of what you say coheres with my current picture, but I will have to think about it for a bit, and come back later!
I’m going to make a few comments as I read through this, but first I’d like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn’t have done otherwise.
Regarding the point about volume. It is true that the RLCT can be written as (Theorem 7.1 of Watanabe’s book “Algebraic Geometry and Statistical Learning Theory”)
λ=limt→0log(V(at)/V(t))loga
where V(t)=∫K(w)<tφ(w)dw is the volume (according to the measure associated to the prior) of the set of parameters w with KL divergence K(w) between the model and truth less than t. For small t we have V(t)≈ctλ(−logt)m−1 where m is the multiplicity. Thus near critical points w∗ with lower RLCT small changes in the cutoff t near t≈0 tend to change the volume of the set of almost true parameters more than near critical points with higher RLCTs.
My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space, and so you read the asymptotic formula for the free energy (where Wα is a region of parameter space containing a critical point w∗α)
Fn(Wα)≈nLn(w∗α)+λαlogn−(m−1)loglogn+OP(1)
as having a logn term that does little more than prefer critical points w∗α that tend to dominate large regions of parameter space according to the prior. If that were true, I would agree this would be underwhelming (or at least, precisely as “whelming” as the BIC, and therefore not adding much beyond the classical story).
However this isn’t what the free energy formula says. Indeed the volume ∫Wαφ(w)dw is a term that contributes only to the constant order term (this is sketched in Chen et al).
I claim it’s better to think of the learning coefficient λ as being a measure of how many bits it takes to specify an almost true parameter with K(w)<1n+1 once you know a parameter with K(w)<1/n, which is “microscopic” rather than “macroscopic” statement. That is, lower λ means that a fixed decrease ΔK is “cheaper” in terms of entropy generated.
So the free energy formula isn’t saying “critical points w∗α dominating large regions tend to dominate the posterior at large n” but rather “critical points w∗α which require fewer bits / less entropy to achieve a fixed ΔK dominate the posterior for large n”. The former statement is both false and uninteresting, the second statement is true and interesting (or I think so anyway).
Thank you for the detailed responses! I very much enjoy discussing these topics :)
My intuitions around the RLCT are very much geometrically informed, and I do think of it as being a kind of flatness measure. However, I don’t think of it as being a “macroscopic” quantity, but rather, a local quantity.
I think the rest of what you say coheres with my current picture, but I will have to think about it for a bit, and come back later!