Gelman Against Parsimony
In two posts, Bayesian stats guru Andrew Gelman argues against parsimony, though it seems to be favored ’round these parts, in particular Solomonoff Induction and BIC as imperfect formalizations of Occam’s Razor.
Gelman says:
I’ve never seen any good general justification for parsimony...
Maybe it’s because I work in social science, but my feeling is: if you can approximate reality with just a few parameters, fine. If you can use more parameters to fold in more information, that’s even better.
In practice, I often use simple models–because they are less effort to fit and, especially, to understand. But I don’t kid myself that they’re better than more complicated efforts!
My favorite quote on this comes from Radford Neal‘s book, Bayesian Learning for Neural Networks, pp. 103-104: “Sometimes a simple model will outperform a more complex model . . . Nevertheless, I believe that deliberately limiting the complexity of the model is not fruitful when the problem is evidently complex. Instead, if a simple model is found that outperforms some particular complex model, the appropriate response is to define a different complex model that captures whatever aspect of the problem led to the simple model performing well.”
...
...ideas like minimum-description-length, parsimony, and Akaike’s information criterion, are particularly relevant when models are estimated using least squares, maximum likelihood, or some other similar optimization method.
When using hierarchical models, we can avoid overfitting and get good descriptions without using parsimony–the idea is that the many parameters of the model are themselves modeled. See here for some discussion of Radford Neal’s ideas in favor of complex models, and see here for an example from my own applied research.
eh, this just seems like a repeat of arguments against greedy reductionism. Parsimony is good except when it loses information, but if you’re losing information you’re not being parsimonious correctly.
If there were a good way of distinguishing between losing information and losing noise, that would be useful.
So: Hamilton’s rule is not being parsimonious “correctly”?
probably not. I’m not exactly sure what you mean by this question since I don’t full understand hamilton’s rule but in general evolutionary stuff only needs to be close enough to correct rather than actually correct.
Losing information isn’t a crime. The virtues of simple models go beyond Occam’s razor. Often, replacing a complex world with a complex model barely counts as progress—since complex models are hard to use and hard to understand.
Gelman wants to throw everything he can into his models—and then use multilevel (a.k.a. hierarchical) models to share information between exchangeable (or conditionally exchangeable) batches of parameters. The key concept: multilevel model structure makes the “effective number of parameters” become a quantity that is itself inferred from the data. So he can afford to take his “against parsimony” stance (which is really a stance against leaving potentially useful predictors out of his models) because his default model choice will induce parsimony just when the data warrant it.
I think one of Gelman’s comments in the first link is helpful:
This is a strange statement for a Bayesian to make. Perhaps he means that there is no reason to require absolute parsimony, which is true; sometimes if you have enough data you can justify the use of complex models. But Bayesian methods certainly require relative parsimony, in the sense that the model complexity needs to be small compared to the quantity of information being modeled. Formally, let A be the entropy of the prior distribution, and B be the mutual information between the observed data and the model parameter(s). Then unless A is small compared to B (relative parsimony), Bayesian updates won’t substantially shift belief away from the prior, and the posterior will be just a minor modification of the prior, so the whole process of obtaining data and performing inference will have produced no actual change in belief.
The difference between the MDL philosophy and the Bayesian philosophy is actually quite minor. There are some esoteric technical arguments about things like whether one method or the other converges in the limit of infinite data, but at the end of the day the two philosophies say almost exactly the same thing.
Not really. Bayesian methods can model random noise. Then the model is of the same size as the data being modeled.
Parsimony is a prior, not an end goal. At least, that’s how it’s used in Solomonoff induction.
The reason the Solomonoff prior doesn’t apply to social sciences is because knowing the area of applicability gives you more information. Once you take that into account, as well as the fact that you don’t have the input data or computational power to recompute the cumulative process that spit humans out so the simple low level theories are out of reach, your prior is skewed towards more complex models.
That doesn’t mean it doesn’t apply! “Knowing the area of applicability” is just some information you can update on after starting with a prior.
The Solomonoff prior doesn’t really apply to any kind of science, and it’s not even part of mainstream Bayesian statistics.
While you can argue whether simpler models are inherently better—basically arguing about the “texture” of the universe we live in—simple models definitely generalize better, so if you act based on a simpler model you have better confidence that things will work “as expected”. Flip coin of this is that to have confidence in complex models you need a lot more data, which is expensive in all kinds of ways.
You could claim that human attraction to simple models is due to their low cost/better generalization rather than b/c “texture of the world” is simple, though unification if physics seems to indicate the later.
Recommended reading: Boyd and Richerson’s Simple Models of Complex Phenomena.
“Everything should be made as simple as possible, but not simpler.”—Albert Einstein.
But yes, Occam’s Razor is not a natural law or anything like that. It’s a heuristic—something that usually points in the right direction but very much not guaranteed to be correct.
It’s arguably a bit more than that, on account of Solomonoff induction. An “Occamian” prior that weights computable hypotheses according to the fraction of computer-program-space occupied by programs that compute their consequences provably performs—in an admittedly somewhat artificial sense—at least as well in the long run as any other prior, provided the observations you see really are generated by something computable.
More practically, there has to be a complexity penalty in the following sense: no matter what probabilities you assign, almost all very complex hypotheses have to be very improbable because otherwise your total probability has to be infinite.
Yes and any prior that doesn’t assign things zero probability has this property. Why that one in particular?
Oh yes, so it does. Let me therefore be both more precise and more accurate.
Let p be an Occamian prior in this sense and q any computable prior. Then as cousin_it remarks “a computable human cannot beat Solomonoff in accumulated log scores by more than a constant, even if the universe is uncomputable and loves the human”; in other words, whatever q is—however much information about the world is built into it in advance—it can’t do much better than p, even though p encodes no information about the world (it can’t since what the theorem says is that even if you choose what the world does pessimally-for-p, it still does pretty well). This is not true for arbitrary priors.
Well, since Solomonoff is uncomputable, this isn’t really a fair comparison.
I wasn’t arguing that we should all be actually doing Solomonoff induction. (Clearly we can’t.) I was saying that there is a somewhat-usable sense in which preferring simpler hypotheses seems to be The Right Thing, or at least A Right Thing. Namely, that basing your probabilities miraculously accurately on simplicity leads to good results. The same isn’t true if you put something other than “simplicity” in that statement.
I wonder whether there are any theorems along similar lines that don’t involve any uncomputable priors. (Something handwavily along the following lines: If p,q are two computable priors and p is dramatically enough “closer to Occamian” than q, then an agent with p as prior will “usually” do better than an agent with q as prior. But I have so far not thought of any statement of this kind that’s both credible and interesting.)
My impression is that Solomonoff induction starts by assuming the Occam’s Razor.
That’s not a problem—all simple hypotheses can be just as improbable.
Again, I am not saying that Occam’s Razor is not a useful heuristic. It is. But it is not evidence.
Can you restate what you consider the use of Occam’s Razor to be, and what you consider evidence to be for?
Because from my perspective the purpose of evidence is to increase/decrease my confidence in various statements, and it seems to me that Occam’s Razor is useful for doing precisely that. So this distinction doesn’t make a lot of sense to me, and rereading the thread doesn’t clarify matters.
The fact that it buys you something interesting without making that assumption was the whole point of the paragraph you were commenting on.
I don’t believe that is true. Perhaps I’ve been insufficiently clear by trying to be brief (the difficulty being that “very complex” is really shorthand for something involving a limiting process), so let me be less brief.
First: Suppose you have a list of mutually exclusive hypotheses H1, H2, etc., with probabilities p1, p2, etc. List them in increasing order of complexity. Then the sum of all the pj is finite, and therefore as j → infinity pj → zero. Hence, “very complex hypotheses (in this list) have to be very improbable” in the following sense: for any probability p, however small, there’s a level of complexity C such that every hypothesis from your list whose complexity is at least C has probability smaller than p.
That doesn’t quite mean that very complex hypotheses have to be improbable. Indeed, you can construct very complex high-probability hypotheses as very long disjunctions. And since p and ~p have about the same complexity for any p, it must in some sense be true that about as many very complex propositions have high probabilities as have low probabilities. (So what I said certainly wasn’t quite right.)
However, I bet something along the following lines is true. Suppose you have a probability distribution over propositions (this is for generating them, and isn’t meant to have anything directly to do with the probability that each proposition is true), and suppose we also assign all the propositions probabilities in a way consistent with the laws of probability theory. (I’m assuming here that our class of propositions is closed under the usual logical operations.) And suppose we also assign all the propositions complexities in any reasonable way. Define the essential complexity of a proposition to be the infimum of the complexities of propositions that imply it. (I’m pretty sure it’s always attained.) Then I conjecture that something like this is both true and fairly easy to prove: for any fixed probability level q, as C → oo, if you generate a proposition at random (according to the “generating” distribution) conditional on its essential complexity being at least C, then Pr(its probability >= q) tends to 0.
Sorry, will put this on hold for a bit—it requires some thinking and I don’t have time for it at the moment...
No problem!