For singular models the Jeffrey Prior is not well-behaved for the simple fact that it will be zero at minima of the loss function. Does this mean the Jeffrey prior is only of interest in regular models? I beg to differ.
Usually the Jeffrey prior is derived as parameterization invariant prior. There is another way of thinking about the Jeffrey prior as arising from an ‘indistinguishability prior’.
The argument is delightfully simple: given two weights w1,w2∈W if they encode the same distribution p(x|w1),p(x|w2) our prior weights on them should be intuitively the same ϕ(w1)=ϕ(w2). Two weights encoding the same distributions means the model exhibit non-identifiability making it non-regular (hence singular). However, regular models exhibit ‘approximate non-identifiability’.
For a given dataset DN of size N from the true distribution q, error ϵ1, ϵ2 we can have a whole set of weights WN,ϵ⊂W where the probability that p(x|w1) does more than ϵ1 better on the loss on DN than p(x|w1) is less than ϵ2.
In other words, the sets of weights that are probabily approximately indistinguishable. Intuitively, we should assign an (approximately) uniform prior on these approximately indistinguishable regions. This gives strong constraints on the possible prior.
The downside of this is that it requires us to know the true distribution q. Instead of seeing if w1,w2 are approximately indistinguishable when sampling from q we can ask if w2 is approximately indistinguishable from w1 when sampling from w2. For regular models this also leads to the Jeffrey prior, see this paper.
However, the Jeffrey prior is just an approximation of this prior. We could also straightforwardly see what the exact prior is to obtain something that might work for singular models.
EDIT: Another approach to generalizing the Jeffrey prior might be by following an MDL optimal coding argument—see this paper.
You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
I argued above that given two weights w1,w2 such that they have (approximately) the same conditional distribution p(x|y,w1)∼=p(x|y,w2) the ‘natural’ or ‘canonical’ prior should assign them equal prior weights ϕ(w1)=ϕ(w2). A more sophisticated version of this idea is used to argue for the Jeffrey prior as a canonical prior.
Some further thoughts:
imposing this uniformity condition would actually contradict some version of Occam’s razor. Indeed, w1 could be algorithmically much more complex (i.e. have much higher description length) than w2 but they still might have similar or the same predictions.
The difference between same-on-the-nose versus similar might be very material. Two conditional probability distributions might be quite similar [a related issue here is that the KL-divergence is assymetric so similarity is a somewhat ill-defined concept], yet one intrinsically requires far more computational resources.
A very simple example is the uniform distribution puniform(x)=1N and another distribution p′(x) that is a small perturbation of the uniform distribution but whose exact probabilities p′(x) have decimal expansions that have very large description length (this can be produced by adding long random strings to the binary expansion).
[caution: CompMech propaganda incoming] More realistic examples do occur i.e. in finding optimal predictors of dynamical systems at the edge of chaos. See the section on ‘intrinsic computation of the period-doubling cascade’, p.27-28 of calculi of emergence for a classical example.
Asking for the prior ϕ to restrict to be uniform for weights wi that have equal/similar conditional distributions p(x|y,wi) seems very natural but it doesn’t specify how the prior should relate weights with different conditional distributions. Let’s say we have two weights w1, w2 with very different conditional probability distributions. Let Wi={w∈W|p(x|y,w)∼=p(x|y,wi)}. How should we compare the prior weights ϕ(W1),ϕ(W2)? Suppose I double the number of w∈W1, i.e.W1↦W′1 where we enlarged W↦W′ such that W′1 has double the volume of W1 and everything else is fixed. Should we have ϕ(W1)=ϕ(W′1) or should the prior weight ϕ(W′1) be larger? In the former case, the a prior weight on ϕ(w) should be reweighted depending on how many w′ there are with similar conditional probability distributions, in the latter it isn’t. ( Note that this is related but distinct from the parameterization invariance condition of the Jeffery prior. ) I can see arguments for both
We could want to impose the condition that quotienting out by the relation w1∼w2 whenp(x|y,w1)=p(x|y,w2) to not affect the model (and thereby the prior) at all.
On the other hand, one could argue that the Solomonoff prior would not impose ϕ(W1)=ϕ(W′1) - if one finds more programs that yield p(x|y,w1) maybe you should put higher a priori credence on p(x|y,w1).
The RLCT λ(w′) of the new elements in w′∈W′1−W1 could behave wildly different from w∈W1. This suggest that the above analysis is not at the right conceptual level and one needs a more refined notion of model identity.
Your comment about more refined type of model identity using jets sounds intriguing. Here is a related thought
In the earlier discussion with Joar Skalse there was a lot of debate around whether a prior simplicity (description length, Kolmogorov complexity according to Joar) is actually captured by the RLCT. It is possible to create examples where the RLCT and the algorithmic complexity diverge.
I haven’t had the chance to think about this very deeply but my superficial impression is that the RLCT λ(Wa) is best thought of as measuring a relative model complexity between Wa and W rather than an absolute measure of complexity of W,Wa.
(more thoughts about relations with MDL. too scattered, I’m going to post now)
I think there’s no such thing as parameters, just processes that produce better and better approximations to parameters, and the only “real” measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).
From that point of view trying to conflate parameters w1,w2 such that p(x|w1)≈p(x|w2) is naive, because w1,w2 aren’t real, only processes that produce better approximations to them are real, and so the ∂∂w derivatives of p(x|w1),p(x|w2) which control such processes are deeply important, and those could be quite different despite p(x|w1)≈p(x|w2) being quite similar.
So I view “local geometry matters” and “the real thing are processes approximating parameters, not parameters” as basically synonymous.
Generalized Jeffrey Prior for singular models?
For singular models the Jeffrey Prior is not well-behaved for the simple fact that it will be zero at minima of the loss function.
Does this mean the Jeffrey prior is only of interest in regular models? I beg to differ.
Usually the Jeffrey prior is derived as parameterization invariant prior. There is another way of thinking about the Jeffrey prior as arising from an ‘indistinguishability prior’.
The argument is delightfully simple: given two weights w1,w2∈W if they encode the same distribution p(x|w1),p(x|w2) our prior weights on them should be intuitively the same ϕ(w1)=ϕ(w2). Two weights encoding the same distributions means the model exhibit non-identifiability making it non-regular (hence singular). However, regular models exhibit ‘approximate non-identifiability’.
For a given dataset DN of size N from the true distribution q, error ϵ1, ϵ2 we can have a whole set of weights WN,ϵ⊂W where the probability that p(x|w1) does more than ϵ1 better on the loss on DN than p(x|w1) is less than ϵ2.
In other words, the sets of weights that are probabily approximately indistinguishable. Intuitively, we should assign an (approximately) uniform prior on these approximately indistinguishable regions. This gives strong constraints on the possible prior.
The downside of this is that it requires us to know the true distribution q. Instead of seeing if w1,w2 are approximately indistinguishable when sampling from q we can ask if w2 is approximately indistinguishable from w1 when sampling from w2. For regular models this also leads to the Jeffrey prior, see this paper.
However, the Jeffrey prior is just an approximation of this prior. We could also straightforwardly see what the exact prior is to obtain something that might work for singular models.
EDIT: Another approach to generalizing the Jeffrey prior might be by following an MDL optimal coding argument—see this paper.
You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
Is this the jet prior I’ve been hearing about?
I argued above that given two weights w1,w2 such that they have (approximately) the same conditional distribution p(x|y,w1)∼=p(x|y,w2) the ‘natural’ or ‘canonical’ prior should assign them equal prior weights ϕ(w1)=ϕ(w2). A more sophisticated version of this idea is used to argue for the Jeffrey prior as a canonical prior.
Some further thoughts:
imposing this uniformity condition would actually contradict some version of Occam’s razor. Indeed, w1 could be algorithmically much more complex (i.e. have much higher description length) than w2 but they still might have similar or the same predictions.
The difference between same-on-the-nose versus similar might be very material. Two conditional probability distributions might be quite similar [a related issue here is that the KL-divergence is assymetric so similarity is a somewhat ill-defined concept], yet one intrinsically requires far more computational resources.
A very simple example is the uniform distribution puniform(x)=1N and another distribution p′(x) that is a small perturbation of the uniform distribution but whose exact probabilities p′(x) have decimal expansions that have very large description length (this can be produced by adding long random strings to the binary expansion).
[caution: CompMech propaganda incoming] More realistic examples do occur i.e. in finding optimal predictors of dynamical systems at the edge of chaos. See the section on ‘intrinsic computation of the period-doubling cascade’, p.27-28 of calculi of emergence for a classical example.
Asking for the prior ϕ to restrict to be uniform for weights wi that have equal/similar conditional distributions p(x|y,wi) seems very natural but it doesn’t specify how the prior should relate weights with different conditional distributions. Let’s say we have two weights w1, w2 with very different conditional probability distributions. Let Wi={w∈W|p(x|y,w)∼=p(x|y,wi)}. How should we compare the prior weights ϕ(W1),ϕ(W2)?
Suppose I double the number of w∈W1, i.e.W1↦W′1 where we enlarged W↦W′ such that W′1 has double the volume of W1 and everything else is fixed. Should we have ϕ(W1)=ϕ(W′1) or should the prior weight ϕ(W′1) be larger? In the former case, the a prior weight on ϕ(w) should be reweighted depending on how many w′ there are with similar conditional probability distributions, in the latter it isn’t. ( Note that this is related but distinct from the parameterization invariance condition of the Jeffery prior. )
I can see arguments for both
We could want to impose the condition that quotienting out by the relation w1∼w2 whenp(x|y,w1)=p(x|y,w2) to not affect the model (and thereby the prior) at all.
On the other hand, one could argue that the Solomonoff prior would not impose ϕ(W1)=ϕ(W′1) - if one finds more programs that yield p(x|y,w1) maybe you should put higher a priori credence on p(x|y,w1).
The RLCT λ(w′) of the new elements in w′∈W′1−W1 could behave wildly different from w∈W1. This suggest that the above analysis is not at the right conceptual level and one needs a more refined notion of model identity.
Your comment about more refined type of model identity using jets sounds intriguing. Here is a related thought
In the earlier discussion with Joar Skalse there was a lot of debate around whether a prior simplicity (description length, Kolmogorov complexity according to Joar) is actually captured by the RLCT. It is possible to create examples where the RLCT and the algorithmic complexity diverge.
I haven’t had the chance to think about this very deeply but my superficial impression is that the RLCT λ(Wa) is best thought of as measuring a relative model complexity between Wa and W rather than an absolute measure of complexity of W,Wa.
(more thoughts about relations with MDL. too scattered, I’m going to post now)
I think there’s no such thing as parameters, just processes that produce better and better approximations to parameters, and the only “real” measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).
From that point of view trying to conflate parameters w1,w2 such that p(x|w1)≈p(x|w2) is naive, because w1,w2 aren’t real, only processes that produce better approximations to them are real, and so the ∂∂w derivatives of p(x|w1),p(x|w2) which control such processes are deeply important, and those could be quite different despite p(x|w1)≈p(x|w2) being quite similar.
So I view “local geometry matters” and “the real thing are processes approximating parameters, not parameters” as basically synonymous.