You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
I argued above that given two weights w1,w2 such that they have (approximately) the same conditional distribution p(x|y,w1)∼=p(x|y,w2) the ‘natural’ or ‘canonical’ prior should assign them equal prior weights ϕ(w1)=ϕ(w2). A more sophisticated version of this idea is used to argue for the Jeffrey prior as a canonical prior.
Some further thoughts:
imposing this uniformity condition would actually contradict some version of Occam’s razor. Indeed, w1 could be algorithmically much more complex (i.e. have much higher description length) than w2 but they still might have similar or the same predictions.
The difference between same-on-the-nose versus similar might be very material. Two conditional probability distributions might be quite similar [a related issue here is that the KL-divergence is assymetric so similarity is a somewhat ill-defined concept], yet one intrinsically requires far more computational resources.
A very simple example is the uniform distribution puniform(x)=1N and another distribution p′(x) that is a small perturbation of the uniform distribution but whose exact probabilities p′(x) have decimal expansions that have very large description length (this can be produced by adding long random strings to the binary expansion).
[caution: CompMech propaganda incoming] More realistic examples do occur i.e. in finding optimal predictors of dynamical systems at the edge of chaos. See the section on ‘intrinsic computation of the period-doubling cascade’, p.27-28 of calculi of emergence for a classical example.
Asking for the prior ϕ to restrict to be uniform for weights wi that have equal/similar conditional distributions p(x|y,wi) seems very natural but it doesn’t specify how the prior should relate weights with different conditional distributions. Let’s say we have two weights w1, w2 with very different conditional probability distributions. Let Wi={w∈W|p(x|y,w)∼=p(x|y,wi)}. How should we compare the prior weights ϕ(W1),ϕ(W2)? Suppose I double the number of w∈W1, i.e.W1↦W′1 where we enlarged W↦W′ such that W′1 has double the volume of W1 and everything else is fixed. Should we have ϕ(W1)=ϕ(W′1) or should the prior weight ϕ(W′1) be larger? In the former case, the a prior weight on ϕ(w) should be reweighted depending on how many w′ there are with similar conditional probability distributions, in the latter it isn’t. ( Note that this is related but distinct from the parameterization invariance condition of the Jeffery prior. ) I can see arguments for both
We could want to impose the condition that quotienting out by the relation w1∼w2 whenp(x|y,w1)=p(x|y,w2) to not affect the model (and thereby the prior) at all.
On the other hand, one could argue that the Solomonoff prior would not impose ϕ(W1)=ϕ(W′1) - if one finds more programs that yield p(x|y,w1) maybe you should put higher a priori credence on p(x|y,w1).
The RLCT λ(w′) of the new elements in w′∈W′1−W1 could behave wildly different from w∈W1. This suggest that the above analysis is not at the right conceptual level and one needs a more refined notion of model identity.
Your comment about more refined type of model identity using jets sounds intriguing. Here is a related thought
In the earlier discussion with Joar Skalse there was a lot of debate around whether a prior simplicity (description length, Kolmogorov complexity according to Joar) is actually captured by the RLCT. It is possible to create examples where the RLCT and the algorithmic complexity diverge.
I haven’t had the chance to think about this very deeply but my superficial impression is that the RLCT λ(Wa) is best thought of as measuring a relative model complexity between Wa and W rather than an absolute measure of complexity of W,Wa.
(more thoughts about relations with MDL. too scattered, I’m going to post now)
I think there’s no such thing as parameters, just processes that produce better and better approximations to parameters, and the only “real” measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).
From that point of view trying to conflate parameters w1,w2 such that p(x|w1)≈p(x|w2) is naive, because w1,w2 aren’t real, only processes that produce better approximations to them are real, and so the ∂∂w derivatives of p(x|w1),p(x|w2) which control such processes are deeply important, and those could be quite different despite p(x|w1)≈p(x|w2) being quite similar.
So I view “local geometry matters” and “the real thing are processes approximating parameters, not parameters” as basically synonymous.
You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
Is this the jet prior I’ve been hearing about?
I argued above that given two weights w1,w2 such that they have (approximately) the same conditional distribution p(x|y,w1)∼=p(x|y,w2) the ‘natural’ or ‘canonical’ prior should assign them equal prior weights ϕ(w1)=ϕ(w2). A more sophisticated version of this idea is used to argue for the Jeffrey prior as a canonical prior.
Some further thoughts:
imposing this uniformity condition would actually contradict some version of Occam’s razor. Indeed, w1 could be algorithmically much more complex (i.e. have much higher description length) than w2 but they still might have similar or the same predictions.
The difference between same-on-the-nose versus similar might be very material. Two conditional probability distributions might be quite similar [a related issue here is that the KL-divergence is assymetric so similarity is a somewhat ill-defined concept], yet one intrinsically requires far more computational resources.
A very simple example is the uniform distribution puniform(x)=1N and another distribution p′(x) that is a small perturbation of the uniform distribution but whose exact probabilities p′(x) have decimal expansions that have very large description length (this can be produced by adding long random strings to the binary expansion).
[caution: CompMech propaganda incoming] More realistic examples do occur i.e. in finding optimal predictors of dynamical systems at the edge of chaos. See the section on ‘intrinsic computation of the period-doubling cascade’, p.27-28 of calculi of emergence for a classical example.
Asking for the prior ϕ to restrict to be uniform for weights wi that have equal/similar conditional distributions p(x|y,wi) seems very natural but it doesn’t specify how the prior should relate weights with different conditional distributions. Let’s say we have two weights w1, w2 with very different conditional probability distributions. Let Wi={w∈W|p(x|y,w)∼=p(x|y,wi)}. How should we compare the prior weights ϕ(W1),ϕ(W2)?
Suppose I double the number of w∈W1, i.e.W1↦W′1 where we enlarged W↦W′ such that W′1 has double the volume of W1 and everything else is fixed. Should we have ϕ(W1)=ϕ(W′1) or should the prior weight ϕ(W′1) be larger? In the former case, the a prior weight on ϕ(w) should be reweighted depending on how many w′ there are with similar conditional probability distributions, in the latter it isn’t. ( Note that this is related but distinct from the parameterization invariance condition of the Jeffery prior. )
I can see arguments for both
We could want to impose the condition that quotienting out by the relation w1∼w2 whenp(x|y,w1)=p(x|y,w2) to not affect the model (and thereby the prior) at all.
On the other hand, one could argue that the Solomonoff prior would not impose ϕ(W1)=ϕ(W′1) - if one finds more programs that yield p(x|y,w1) maybe you should put higher a priori credence on p(x|y,w1).
The RLCT λ(w′) of the new elements in w′∈W′1−W1 could behave wildly different from w∈W1. This suggest that the above analysis is not at the right conceptual level and one needs a more refined notion of model identity.
Your comment about more refined type of model identity using jets sounds intriguing. Here is a related thought
In the earlier discussion with Joar Skalse there was a lot of debate around whether a prior simplicity (description length, Kolmogorov complexity according to Joar) is actually captured by the RLCT. It is possible to create examples where the RLCT and the algorithmic complexity diverge.
I haven’t had the chance to think about this very deeply but my superficial impression is that the RLCT λ(Wa) is best thought of as measuring a relative model complexity between Wa and W rather than an absolute measure of complexity of W,Wa.
(more thoughts about relations with MDL. too scattered, I’m going to post now)
I think there’s no such thing as parameters, just processes that produce better and better approximations to parameters, and the only “real” measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).
From that point of view trying to conflate parameters w1,w2 such that p(x|w1)≈p(x|w2) is naive, because w1,w2 aren’t real, only processes that produce better approximations to them are real, and so the ∂∂w derivatives of p(x|w1),p(x|w2) which control such processes are deeply important, and those could be quite different despite p(x|w1)≈p(x|w2) being quite similar.
So I view “local geometry matters” and “the real thing are processes approximating parameters, not parameters” as basically synonymous.