Localized theories and conditional complexity
Suppose I hand you a series of data points without providing the context. Consider the theory v = a*t for t<<1, v = b for t>>1. Without knowing anything a priori about the shapes of the curves, one must have enough data to make sure that v follows the right lines at the two limits since there is complexity that must be justified. Here we have two one-parameter curves, so we need at least two data points to pick the right slope and offset, as well as at least a couple more to make sure it follows the right shape.
This is what I’ll call a completely local theory – see data, fit curve. Dealing with problems at this level does not leave much room for human bias or error, but it also does not allow for improvement by including background knowledge.
Now consider the case where v = velocity of a rocket sled, a = thrust/mass of sled, and b = sqrt(thrust/(1/2*rho*Cd*Af)). If you have a theory explaining rockets and aerodynamics, the equation v=a*t and v = b are just the limiting cases for small and large t. In this case, you only need two data points to find a and b since you already know the shape (over the full range) from solving the differential equations. If you understand the aerodynamics well enough, and know the shape and mass of the sled, you don’t even need to do the experiment! The “conditional complexity” is 0 since it is directly predicted from what we already know. This is the magic of keeping track of the dependencies between theories.
We can take this a step further and derive a theory of aerodynamics from a theory of air molecules- and so on until we have one massively connected TOE.
Now step back to the beginning. If all I tell you is that when t = 1e-5, v = 2e-5 and when t = 1e-3, v = 2e-3, you’re going to come up with the equationv = 2*t. If someone, with no further information, suggested that v = 2*t was only a small t approximation, and that for large t, v = 5.32, you’d think that he’s nutso (and rightfully so), with all that unnecessary complexity.
As a wannabe Bayesian, you need to update on all evidence, so we’re almost never trying to fit data without knowing what it means. We prefer globally simple theories, not theories where each local section is simple but they don’t want to fit together.
I suspect that one of the main reasons people fail to understand/accept Occam’s razor comes from trying to apply it to theories locally and then noticing that by importing information from a more general theory, they can do better. Of course you do better with more information than you do with a wisely chosen ignorance prior. You need to apply Occams razor to the whole bundle. Since all of the background theory is the same, you can reduce this to the entropy of the local theories that is left after conditioning on the background theory.
When Eliezer says that he doesn’t expect humans to have simple utility functions, its not because it is a magical case where Occam’s razor doesn’t apply. It’s that it would take more bits overall to explain evolution creating a simple utility function than it would be to explain evolution creating a particular locally complex utility function. This is very different than concluding that Occam’s razor doesn’t fit to real life. If Occam’s razor seems to be giving you bad answers, you’re doing it wrong.
What does this imply for the future? Those with poor memories and/or a poor understanding of history will answer “much like the present” based on a single point and the locally simplest fit. You can find people one step up from that who notice improvements over time and fit it to a line, which again isn’t a bad guess if that’s all you know (you almost always know more- actually using the rest of your information efficiently is the trick). Another step ahead and you get people who hypothesize an exponential growth based on their understanding of improvements feeding improvements, or at least a wider spanning dataset. This is where you’ll find Ray Kurzweil and the ‘accellerating improvement singularity’ folk. The last step I know of is where you’ll find Eliezer Yudkowsky and other ‘hard takeoff’ folks. This is where you say “yes, I know my theory is locally more complex- I know that it isn’t obvious from looking at the curve, quite the opposite. However, my theory is less complex after conditioning on the rest of my knowledge that doesn’t show up on this plot, and for that reason, I believe it to be true”.
This might sound like saying “Emeralds are Grue not green”, but while “X until arbitrary date, then Y” fares worse when applying Occam’s razor locally, if our theory of color indicated a special ‘turning point’, then we would have to conclude “Emeralds are grue, not green”, and we would conclude this beacuse of Occam’s razor, not inspite of it.
I chose this example because it is important and well known at LW, but not for lack of other examples. In my experience, this is a very common mistake, even for otherwise intelligent individuals. This makes getting it right a quite fun rationality ‘superpower’.
One thing that keeps bothering me—while Bayesianism is the best epistemological theory we have, nobody is even remotely close to following Bayes rule in its entirety. It just isn’t usable for anything except the most trivial scenarios, and it’s not just a case of throwing more computational power at it, see Quine’s holism argument for scale of the problem.
Adding any heuristics on top of it, like all sorts of assumed independence, and it’s no longer Bayes rule, and you can as well abandon it completely and use something completely different.
I know it’s only vaguely related to the post, but it keeps bothering me that we pretend we’re Bayesians, while we’re not really in any meaningful sense.
Hm, I’m not sure what Quine’s holism argument does for your point. Quine says that only “science” as a whole can be tested, since you can keep making arbitrary assumptions to salvage any theory. A canonical example might be: They didn’t throw out Newton’s Law of Gravitation when Saturn (or whatever) had an irregular orbit; they assumed a massive object behind it—but why not reject the law of gravitation?
But far from being an example of the intractability of the Bayes Theorem, it’s actually an example of how Bayesianism can resolve such problems: you build a belief network that connects the theories to the predicted observations. As you encounter new data, you update based on a well-defined rule (based on priors and likelihood ratios), which can be equivalently expressed as Jaynes’s Maximum Entropy Method or the KL divergence minimization method.
This process allows you to determine whether a new observation requires you to believe you made an observation error, whether you should expect an additional observation (like the above case with Saturn), or whether you need to make fundamental revisions to the theory.
Of course, that’s still just approximating an ideal Bayesian, and conditional independence isn’t a required part of it, just a prior you can start from. But Quine’s holism argument doesn’t show any shortcoming of the use of Bayesian inference in science.
Holism means that for virtually every observation you have to update everything in your network, related or not. You cannot have nice networks with a bunch of small elegant compartments—it will be one huge mess.
My point is that actual Bayesianism is so insanely far beyond capabilities of any imaginable being, and Bayesianism with any assumed independence is no longer correct Bayesianism, that it’s not really proper to pretend we’re Bayesians in any practical sense.
Isn’t that like saying that Newton’s adherents weren’t “actual Newtonians” because they assumed away enough bodies to make their computations tractable?
Agreed. I changed it to “wannabe” for you :p