I suppose the question is, how to calculate the priors so they do make sense. In particular, how can an AI estimate priors. I’m sure there is a lot of existing work on this. The problem with making statements about priors that don’t have a formal process for their calculation is that there is no basis for comparing two predictions. In the worst case, by adjusting the prior the resulting probabilities can be adjusted to any value. Making the approach a formal technique which is potentially just hiding the unknowns in the priors. In effect being no more reasonable because the priors are a guess.
In particular, how can an AI estimate priors. I’m sure there is a lot of existing work on this.
There is. For example, one can use the Jeffreys prior, which has the desirable property of being invariant under different parametrization choices, or one can pick a prior according to the maximum entropy principle, which says to pick the prior with the greatest entropy that satisfies the model constraints. I don’t know if anyone’s come up with a meta-rationale that justifies one of these approaches over all others (or explains when to use different approaches), though.
Thank you, this is very interesting. I’m not sure of the etiquette, but I’m reposting a question from an old article, that I would really appreciate your thoughts on.
Is it correct, to say that the entropy prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?
If so I was wondering if it could be extended to reflect other aesthetics. For example, if an experiment produces a single result that is inconsistent with an existing simple physics theory, it may be that the simplest theory that explains this data is to treat this result as an isolated exception, however, aesthetically we find it more plausible that this exception is evidence of a larger theory that the sample is one part of.
In contrast when attempting to understand the rules of a human system (e.g. a bureaucracy) constructing a theory that lacked exceptions seems unlikely (“that’s a little too neat”). Indeed when stated informally the phrase might go “in my experience, that’s a little too neat” implying that we formulate priors based on learned patterns from experience. In the case of the bureaucracy, this may stem from a probabilistic understanding of the types of system that result from a particular ‘maker’ (i.e. politics).
However, this moves the problem to one of classifying contexts and determining which contexts are relevant, if this process is considered part of the theory, then it may considerably increase its complexity always preferring theories which ignore context. Unless of course the theory is complete (incorporating all contexts) in which case the simplest theory may share these contextual models and thus become the universal simplest model. It would therefore not be rational to apply Kolmogorov complexity to a problem in isolation. I.e. probability and reductionism are not compatible.
With the disclaimer that I’m no expert and quite possibly wrong about some of this, here goes.
Is it correct, to say that the entropy prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?
No. Or, at least, that’s not the conscious motivation for the maximum entropy principle (MAXENT). As I see it, the justification for MAXENT is that entropy measures the “uncertainty” the prior represents, and we should choose the prior that represents greatest uncertainty, because that means assuming the least possible additional information about the problem.
Now, it does sometimes happen that MAXENT tells you to pick a prior with what I’d guess you think of as “simpler structure”. Suppose you’re hiding in your fist a 6-sided die I know nothing about, and you ask me to give you my probability distribution for which side’ll come up when you roll it. As I know nothing about the die, I have no basis for imposing additional constraints on the problem, so the only operative constraint is that P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1; given just that constraint, MAXENT says I should assign probability 1⁄6 to each side.
In that particular case, MAXENT gives a nice, smooth, intuitively pleasing result. But if we impose a new constraint, e.g. that the expected value of the die roll is 4.5 (instead of the 3.5 implied by the uniform distribution), MAXENT says the appropriate probability distribution is {0.054, 0.079, 0.114, 0.165, 0.240, 0.348} for sides 1 to 6 respectively (from here), which doesn’t look especially simple to me. So for all but the most basic problems, I expect MAXENT doesn’t conform to the “simpler structures” heuristic.
There is probably some definition of “simple” or “complex” that would make your heuristic equivalent to MAXENT, but I doubt it’d correspond to how we normally think of simplicity/complexity.
In statistics, I think ‘weakly informative priors’ are becoming more popular. Weakly informative priors are distributions like a t distribution (or normal) with a really wide standard deviation and low degrees of freedom. This allows us to avoid spending all out data on merely narrowing down the correct order of order of magnitude, which can be a problem in many problems using non-informative priors. It’s almost never the case that we literally know nothing prior to the data.
I suppose the question is, how to calculate the priors so they do make sense. In particular, how can an AI estimate priors. I’m sure there is a lot of existing work on this. The problem with making statements about priors that don’t have a formal process for their calculation is that there is no basis for comparing two predictions. In the worst case, by adjusting the prior the resulting probabilities can be adjusted to any value. Making the approach a formal technique which is potentially just hiding the unknowns in the priors. In effect being no more reasonable because the priors are a guess.
There is. For example, one can use the Jeffreys prior, which has the desirable property of being invariant under different parametrization choices, or one can pick a prior according to the maximum entropy principle, which says to pick the prior with the greatest entropy that satisfies the model constraints. I don’t know if anyone’s come up with a meta-rationale that justifies one of these approaches over all others (or explains when to use different approaches), though.
Thank you, this is very interesting. I’m not sure of the etiquette, but I’m reposting a question from an old article, that I would really appreciate your thoughts on.
Is it correct, to say that the entropy prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?
If so I was wondering if it could be extended to reflect other aesthetics. For example, if an experiment produces a single result that is inconsistent with an existing simple physics theory, it may be that the simplest theory that explains this data is to treat this result as an isolated exception, however, aesthetically we find it more plausible that this exception is evidence of a larger theory that the sample is one part of.
In contrast when attempting to understand the rules of a human system (e.g. a bureaucracy) constructing a theory that lacked exceptions seems unlikely (“that’s a little too neat”). Indeed when stated informally the phrase might go “in my experience, that’s a little too neat” implying that we formulate priors based on learned patterns from experience. In the case of the bureaucracy, this may stem from a probabilistic understanding of the types of system that result from a particular ‘maker’ (i.e. politics).
However, this moves the problem to one of classifying contexts and determining which contexts are relevant, if this process is considered part of the theory, then it may considerably increase its complexity always preferring theories which ignore context. Unless of course the theory is complete (incorporating all contexts) in which case the simplest theory may share these contextual models and thus become the universal simplest model. It would therefore not be rational to apply Kolmogorov complexity to a problem in isolation. I.e. probability and reductionism are not compatible.
With the disclaimer that I’m no expert and quite possibly wrong about some of this, here goes.
No. Or, at least, that’s not the conscious motivation for the maximum entropy principle (MAXENT). As I see it, the justification for MAXENT is that entropy measures the “uncertainty” the prior represents, and we should choose the prior that represents greatest uncertainty, because that means assuming the least possible additional information about the problem.
Now, it does sometimes happen that MAXENT tells you to pick a prior with what I’d guess you think of as “simpler structure”. Suppose you’re hiding in your fist a 6-sided die I know nothing about, and you ask me to give you my probability distribution for which side’ll come up when you roll it. As I know nothing about the die, I have no basis for imposing additional constraints on the problem, so the only operative constraint is that P(1) + P(2) + P(3) + P(4) + P(5) + P(6) = 1; given just that constraint, MAXENT says I should assign probability 1⁄6 to each side.
In that particular case, MAXENT gives a nice, smooth, intuitively pleasing result. But if we impose a new constraint, e.g. that the expected value of the die roll is 4.5 (instead of the 3.5 implied by the uniform distribution), MAXENT says the appropriate probability distribution is {0.054, 0.079, 0.114, 0.165, 0.240, 0.348} for sides 1 to 6 respectively (from here), which doesn’t look especially simple to me. So for all but the most basic problems, I expect MAXENT doesn’t conform to the “simpler structures” heuristic.
There is probably some definition of “simple” or “complex” that would make your heuristic equivalent to MAXENT, but I doubt it’d correspond to how we normally think of simplicity/complexity.
Thank you, that’s very interesting, and comforting.
In statistics, I think ‘weakly informative priors’ are becoming more popular. Weakly informative priors are distributions like a t distribution (or normal) with a really wide standard deviation and low degrees of freedom. This allows us to avoid spending all out data on merely narrowing down the correct order of order of magnitude, which can be a problem in many problems using non-informative priors. It’s almost never the case that we literally know nothing prior to the data.
Using a normal with a massive variance is also a standard hack for getting a proper “uninformative” prior on the real line.