Alexander Gietelink Oldenziel comments on Alexander Gietelink Oldenziel’s Shortform

Alexander Gietelink Oldenziel 3 Aug 2023 12:56 UTC
2 points
0
Generalized Jeffrey Prior for singular models?
For singular models the Jeffrey Prior is not well-behaved for the simple fact that it will be zero at minima of the loss function.
Does this mean the Jeffrey prior is only of interest in regular models? I beg to differ.
Usually the Jeffrey prior is derived as parameterization invariant prior. There is another way of thinking about the Jeffrey prior as arising from an ‘indistinguishability prior’.
The argument is delightfully simple: given two weights $w_{1}, w_{2} \in W$ if they encode the same distribution $p (x | w_{1}), p (x | w_{2})$ our prior weights on them should be intuitively the same $ϕ (w_{1}) = ϕ (w_{2})$ . Two weights encoding the same distributions means the model exhibit non-identifiability making it non-regular (hence singular). However, regular models exhibit ‘approximate non-identifiability’.
For a given dataset $D_{N}$ of size $N$ from the true distribution $q$ , error $ϵ_{1}$ , $ϵ_{2}$ we can have a whole set of weights $W_{N, ϵ} \subset W$ where the probability that $p (x | w_{1})$ does more than $ϵ_{1}$ better on the loss on $D_{N}$ than $p (x | w_{1})$ is less than $ϵ_{2}$ .
In other words, the sets of weights that are probabily approximately indistinguishable. Intuitively, we should assign an (approximately) uniform prior on these approximately indistinguishable regions. This gives strong constraints on the possible prior.
The downside of this is that it requires us to know the true distribution $q$ . Instead of seeing if $w_{1}, w_{2}$ are approximately indistinguishable when sampling from $q$ we can ask if $w_{2}$ is approximately indistinguishable from $w_{1}$ when sampling from $w_{2}$ . For regular models this also leads to the Jeffrey prior, see this paper.
However, the Jeffrey prior is just an approximation of this prior. We could also straightforwardly see what the exact prior is to obtain something that might work for singular models.
EDIT: Another approach to generalizing the Jeffrey prior might be by following an MDL optimal coding argument—see this paper.
- Daniel Murfet 27 Nov 2023 18:32 UTC
  1 point
  0
  Parent
  You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
  - Alexander Gietelink Oldenziel 17 Dec 2023 16:41 UTC
    2 points
    0
    Parent
    Is this the jet prior I’ve been hearing about?
    I argued above that given two weights $w_{1}, w_{2}$ such that they have (approximately) the same conditional distribution $p (x | y, w_{1}) \sim= p (x | y, w_{2})$ the ‘natural’ or ‘canonical’ prior should assign them equal prior weights $ϕ (w_{1}) = ϕ (w_{2})$ . A more sophisticated version of this idea is used to argue for the Jeffrey prior as a canonical prior.
    Some further thoughts:
    imposing this uniformity condition would actually contradict some version of Occam’s razor. Indeed, $w_{1}$ could be algorithmically much more complex (i.e. have much higher description length) than $w_{2}$ but they still might have similar or the same predictions.
    The difference between same-on-the-nose versus similar might be very material. Two conditional probability distributions might be quite similar [a related issue here is that the KL-divergence is assymetric so similarity is a somewhat ill-defined concept], yet one intrinsically requires far more computational resources.
    A very simple example is the uniform distribution $p_{u n i f o r m} (x) = \frac{1}{N}$ and another distribution $p^{'} (x)$ that is a small perturbation of the uniform distribution but whose exact probabilities $p^{'} (x)$ have decimal expansions that have very large description length (this can be produced by adding long random strings to the binary expansion).
    [caution: CompMech propaganda incoming] More realistic examples do occur i.e. in finding optimal predictors of dynamical systems at the edge of chaos. See the section on ‘intrinsic computation of the period-doubling cascade’, p.27-28 of calculi of emergence for a classical example.
    Asking for the prior $ϕ$ to restrict to be uniform for weights $w_{i}$ that have equal/similar conditional distributions $p (x | y, w_{i})$ seems very natural but it doesn’t specify how the prior should relate weights with different conditional distributions. Let’s say we have two weights $w_{1}$ , $w_{2}$ with very different conditional probability distributions. Let $W_{i} = {w \in W | p (x | y, w) \sim= p (x | y, w_{i})}$ . How should we compare the prior weights $ϕ (W_{1}), ϕ (W_{2})$ ?
    Suppose I double the number of $w \in W_{1}$ , i.e. $W_{1} \mapsto W_{1}^{'}$ where we enlarged $W \mapsto W^{'}$ such that $W_{1}^{'}$ has double the volume of $W_{1}$ and everything else is fixed. Should we have $ϕ (W_{1}) = ϕ (W_{1}^{'})$ or should the prior weight $ϕ (W_{1}^{'})$ be larger? In the former case, the a prior weight on $ϕ (w)$ should be reweighted depending on how many $w^{'}$ there are with similar conditional probability distributions, in the latter it isn’t. ( Note that this is related but distinct from the parameterization invariance condition of the Jeffery prior. )
    I can see arguments for both
    We could want to impose the condition that quotienting out by the relation $w_{1} \sim w_{2}$ when $p (x | y, w_{1}) = p (x | y, w_{2})$ to not affect the model (and thereby the prior) at all.
    On the other hand, one could argue that the Solomonoff prior would not impose $ϕ (W_{1}) = ϕ (W_{1}^{'})$ - if one finds more programs that yield $p (x | y, w_{1})$ maybe you should put higher a priori credence on $p (x | y, w_{1})$ .
    The RLCT $λ (w^{'})$ of the new elements in $w^{'} \in W_{1}^{'} - W_{1}$ could behave wildly different from $w \in W_{1}$ . This suggest that the above analysis is not at the right conceptual level and one needs a more refined notion of model identity.
    Your comment about more refined type of model identity using jets sounds intriguing. Here is a related thought
    In the earlier discussion with Joar Skalse there was a lot of debate around whether a prior simplicity (description length, Kolmogorov complexity according to Joar) is actually captured by the RLCT. It is possible to create examples where the RLCT and the algorithmic complexity diverge.
    I haven’t had the chance to think about this very deeply but my superficial impression is that the RLCT $λ (W_{a})$ is best thought of as measuring a relative model complexity between $W_{a}$ and $W$ rather than an absolute measure of complexity of $W, W_{a}$ .
    (more thoughts about relations with MDL. too scattered, I’m going to post now)
    - Daniel Murfet 17 Dec 2023 22:58 UTC
      5 points
      0
      Parent
      I think there’s no such thing as parameters, just processes that produce better and better approximations to parameters, and the only “real” measures of complexity have to do with the invariants that determine the costs of those processes, which in statistical learning theory are primarily geometric (somewhat tautologically, since the process of approximation is essentially a process of probing the geometry of the governing potential near the parameter).
      
      From that point of view trying to conflate parameters $w_{1}, w_{2}$ such that $p (x | w_{1}) \approx p (x | w_{2})$ is naive, because $w_{1}, w_{2}$ aren’t real, only processes that produce better approximations to them are real, and so the $\frac{\partial}{\partial w}$ derivatives of $p (x | w_{1}), p (x | w_{2})$ which control such processes are deeply important, and those could be quite different despite $p (x | w_{1}) \approx p (x | w_{2})$ being quite similar.
      
      So I view “local geometry matters” and “the real thing are processes approximating parameters, not parameters” as basically synonymous.