The complexity of “A&B&C&D” is roughly equal to the complexity of “A||B||C||D”, but we know for certain that the former hypothesis can never be more probable than the latter,
I think you are trying to make formal arguments based on informal definitions, leading to confusion. KC refers to program lengths, but if A,B,C are programs, then what is A||B||C? More careful definitions would lead to a more traditional conclusion.
For one way to formalize the A/B/C/D point, imagine you are trying to solve the UCI Census Income problem, where one attempts to predict whether a person’s income is greater than 50k based on various other factors. Let A/B/C/D refer to binary predictors such as (is-married?), (is-religious?), (has-college-degree?), and (is-military?). The goal is to find a combination of these that predicts income well. Now it is of course true that (A||B||C||D) is more likely than (A&B&C&D) for a person. But there is no prior reason to believe that the OR-hypothesis has more predictive power than the AND-hypothesis, just as a KC-style analysis would indicate.
I think you are trying to make formal arguments based on informal definitions, leading to confusion. KC refers to program lengths, but if A,B,C are programs, then what is A||B||C?
Whaa? I found that paragraph more confused than anything I wrote :-) Kolmogorov complexity is defined for arbitrary bit strings, which can be not programs but (say) statements about the world expressed in some formal language. Yeah, the definition of KC quantifies over programs that generate those bit strings; so what.
Cannot parse your second paragraph because I don’t understand why “predictive power” should be related to prior probability of a hypothesis.
why “predictive power” should be related to prior probability of a hypothesis.
To solve the learning problem as described, you must ask: what is the prior probability that a given rule/hypothesis will correctly predict income? It is trivially true that the OR-hypothesis selects a larger number of people, but there is no reason to believe it will more accurately predict income.
Since you don’t buy the KC idea, do you also refuse to accept the more general idea of capacity control/regularization/MDL as a (the) way to prevent overfitting and achieve generalization? In the standard setting of the learning problem, it seems inevitable that some method of penalizing complexity is necessary for generalization.
I thought about it some more and it seems you’re right. In learning problems we need some weighting of hypotheses to prevent overfitting, description length has no obvious downsides, so we can just use it and be happy. Now I just need to shoehorn Islam into the “learning problem” framework, to understand why our prior for it should be low...
I think you are trying to make formal arguments based on informal definitions, leading to confusion. KC refers to program lengths, but if A,B,C are programs, then what is A||B||C? More careful definitions would lead to a more traditional conclusion.
For one way to formalize the A/B/C/D point, imagine you are trying to solve the UCI Census Income problem, where one attempts to predict whether a person’s income is greater than 50k based on various other factors. Let A/B/C/D refer to binary predictors such as (is-married?), (is-religious?), (has-college-degree?), and (is-military?). The goal is to find a combination of these that predicts income well. Now it is of course true that (A||B||C||D) is more likely than (A&B&C&D) for a person. But there is no prior reason to believe that the OR-hypothesis has more predictive power than the AND-hypothesis, just as a KC-style analysis would indicate.
Whaa? I found that paragraph more confused than anything I wrote :-) Kolmogorov complexity is defined for arbitrary bit strings, which can be not programs but (say) statements about the world expressed in some formal language. Yeah, the definition of KC quantifies over programs that generate those bit strings; so what.
Cannot parse your second paragraph because I don’t understand why “predictive power” should be related to prior probability of a hypothesis.
To solve the learning problem as described, you must ask: what is the prior probability that a given rule/hypothesis will correctly predict income? It is trivially true that the OR-hypothesis selects a larger number of people, but there is no reason to believe it will more accurately predict income.
Since you don’t buy the KC idea, do you also refuse to accept the more general idea of capacity control/regularization/MDL as a (the) way to prevent overfitting and achieve generalization? In the standard setting of the learning problem, it seems inevitable that some method of penalizing complexity is necessary for generalization.
I thought about it some more and it seems you’re right. In learning problems we need some weighting of hypotheses to prevent overfitting, description length has no obvious downsides, so we can just use it and be happy. Now I just need to shoehorn Islam into the “learning problem” framework, to understand why our prior for it should be low...
This isn’t about prior though.