Suppose I hand you a series of data points without providing the context. Consider the theory v = a*t for t<<1, v = b for t>>1. Without knowing anything a priori about the shapes of the curves, one must have enough data to make sure that v follows the right lines at the two limits since there is complexity that must be justified. Here we have two one-parameter curves, so we need at least two data points to pick the right slope and offset, as well as at least a couple more to make sure it follows the right shape.
This is what I’ll call a completely local theory – see data, fit curve. Dealing with problems at this level does not leave much room for human bias or error, but it also does not allow for improvement by including background knowledge.
Now consider the case where v = velocity of a rocket sled, a = thrust/mass of sled, and b = sqrt(thrust/(1/2*rho*Cd*Af)). If you have a theory explaining rockets and aerodynamics, the equation v=a*t and v = b are just the limiting cases for small and large t. In this case, you only need two data points to find a and b since you already know the shape (over the full range) from solving the differential equations. If you understand the aerodynamics well enough, and know the shape and mass of the sled, you don’t even need to do the experiment! The “conditional complexity” is 0 since it is directly predicted from what we already know. This is the magic of keeping track of the dependencies between theories.
We can take this a step further and derive a theory of aerodynamics from a theory of air molecules- and so on until we have one massively connected TOE.
Now step back to the beginning. If all I tell you is that when t = 1e-5, v = 2e-5 and when t = 1e-3, v = 2e-3, you’re going to come up with the equationv = 2*t. If someone, with no further information, suggested that v = 2*t was only a small t approximation, and that for large t, v = 5.32, you’d think that he’s nutso (and rightfully so), with all that unnecessary complexity.
As a wannabe Bayesian, you need to update on all evidence, so we’re almost never trying to fit data without knowing what it means. We prefer globally simple theories, not theories where each local section is simple but they don’t want to fit together.
I suspect that one of the main reasons people fail to understand/accept Occam’s razor comes from trying to apply it to theories locally and then noticing that by importing information from a more general theory, they can do better. Of course you do better with more information than you do with a wisely chosen ignorance prior. You need to apply Occams razor to the whole bundle. Since all of the background theory is the same, you can reduce this to the entropy of the local theories that is left after conditioning on the background theory.
When Eliezer says that he doesn’t expect humans to have simple utility functions, its not because it is a magical case where Occam’s razor doesn’t apply. It’s that it would take more bits overall to explain evolution creating a simple utility function than it would be to explain evolution creating a particular locally complex utility function. This is very different than concluding that Occam’s razor doesn’t fit to real life. If Occam’s razor seems to be giving you bad answers, you’re doing it wrong.
What does this imply for the future? Those with poor memories and/or a poor understanding of history will answer “much like the present” based on a single point and the locally simplest fit. You can find people one step up from that who notice improvements over time and fit it to a line, which again isn’t a bad guess if that’s all you know (you almost always know more- actually using the rest of your information efficiently is the trick). Another step ahead and you get people who hypothesize an exponential growth based on their understanding of improvements feeding improvements, or at least a wider spanning dataset. This is where you’ll find Ray Kurzweil and the ‘accellerating improvement singularity’ folk. The last step I know of is where you’ll find Eliezer Yudkowsky and other ‘hard takeoff’ folks. This is where you say “yes, I know my theory is locally more complex- I know that it isn’t obvious from looking at the curve, quite the opposite. However, my theory is less complex after conditioning on the rest of my knowledge that doesn’t show up on this plot, and for that reason, I believe it to be true”.
This might sound like saying “Emeralds are Grue not green”, but while “X until arbitrary date, then Y” fares worse when applying Occam’s razor locally, if our theory of color indicated a special ‘turning point’, then we would have to conclude “Emeralds are grue, not green”, and we would conclude this beacuse of Occam’s razor, not inspite of it.
I chose this example because it is important and well known at LW, but not for lack of other examples. In my experience, this is a very common mistake, even for otherwise intelligent individuals. This makes getting it right a quite fun rationality ‘superpower’.
Localized theories and conditional complexity
Suppose I hand you a series of data points without providing the context. Consider the theory v = a*t for t<<1, v = b for t>>1. Without knowing anything a priori about the shapes of the curves, one must have enough data to make sure that v follows the right lines at the two limits since there is complexity that must be justified. Here we have two one-parameter curves, so we need at least two data points to pick the right slope and offset, as well as at least a couple more to make sure it follows the right shape.
This is what I’ll call a completely local theory – see data, fit curve. Dealing with problems at this level does not leave much room for human bias or error, but it also does not allow for improvement by including background knowledge.
Now consider the case where v = velocity of a rocket sled, a = thrust/mass of sled, and b = sqrt(thrust/(1/2*rho*Cd*Af)). If you have a theory explaining rockets and aerodynamics, the equation v=a*t and v = b are just the limiting cases for small and large t. In this case, you only need two data points to find a and b since you already know the shape (over the full range) from solving the differential equations. If you understand the aerodynamics well enough, and know the shape and mass of the sled, you don’t even need to do the experiment! The “conditional complexity” is 0 since it is directly predicted from what we already know. This is the magic of keeping track of the dependencies between theories.
We can take this a step further and derive a theory of aerodynamics from a theory of air molecules- and so on until we have one massively connected TOE.
Now step back to the beginning. If all I tell you is that when t = 1e-5, v = 2e-5 and when t = 1e-3, v = 2e-3, you’re going to come up with the equationv = 2*t. If someone, with no further information, suggested that v = 2*t was only a small t approximation, and that for large t, v = 5.32, you’d think that he’s nutso (and rightfully so), with all that unnecessary complexity.
As a wannabe Bayesian, you need to update on all evidence, so we’re almost never trying to fit data without knowing what it means. We prefer globally simple theories, not theories where each local section is simple but they don’t want to fit together.
I suspect that one of the main reasons people fail to understand/accept Occam’s razor comes from trying to apply it to theories locally and then noticing that by importing information from a more general theory, they can do better. Of course you do better with more information than you do with a wisely chosen ignorance prior. You need to apply Occams razor to the whole bundle. Since all of the background theory is the same, you can reduce this to the entropy of the local theories that is left after conditioning on the background theory.
When Eliezer says that he doesn’t expect humans to have simple utility functions, its not because it is a magical case where Occam’s razor doesn’t apply. It’s that it would take more bits overall to explain evolution creating a simple utility function than it would be to explain evolution creating a particular locally complex utility function. This is very different than concluding that Occam’s razor doesn’t fit to real life. If Occam’s razor seems to be giving you bad answers, you’re doing it wrong.
What does this imply for the future? Those with poor memories and/or a poor understanding of history will answer “much like the present” based on a single point and the locally simplest fit. You can find people one step up from that who notice improvements over time and fit it to a line, which again isn’t a bad guess if that’s all you know (you almost always know more- actually using the rest of your information efficiently is the trick). Another step ahead and you get people who hypothesize an exponential growth based on their understanding of improvements feeding improvements, or at least a wider spanning dataset. This is where you’ll find Ray Kurzweil and the ‘accellerating improvement singularity’ folk. The last step I know of is where you’ll find Eliezer Yudkowsky and other ‘hard takeoff’ folks. This is where you say “yes, I know my theory is locally more complex- I know that it isn’t obvious from looking at the curve, quite the opposite. However, my theory is less complex after conditioning on the rest of my knowledge that doesn’t show up on this plot, and for that reason, I believe it to be true”.
This might sound like saying “Emeralds are Grue not green”, but while “X until arbitrary date, then Y” fares worse when applying Occam’s razor locally, if our theory of color indicated a special ‘turning point’, then we would have to conclude “Emeralds are grue, not green”, and we would conclude this beacuse of Occam’s razor, not inspite of it.
I chose this example because it is important and well known at LW, but not for lack of other examples. In my experience, this is a very common mistake, even for otherwise intelligent individuals. This makes getting it right a quite fun rationality ‘superpower’.