I think a sensible step would be to treat a human as a black box statistical learning machine which can produce patterns (systems, theories etc.) that simplify the data and can be used to make predictions. A human has no special qualities that distinguishes it from automated approaches such as a support vector machine. These solutions can be seen as necessary approximations to the correct space of possible hypotheses from which predictions may be impractical (particularly as priors are unknown).
One way of having confidence in a particular output from these black boxes is the use of some additional prior over the likelihood of different theories (their elegance if you like), but I’m not sure to what extent such a prior can rationally be determined, i.e. the pattern of likely theories, of which simplicity is a factor.
Another approach is the scientific method, model with a subset and then validate with additional data (a common AI approach to minimise overfitting) I am not sufficiently knowledgeable in statistical learning theory to know how (or if) such approaches can be shown to provably improve predictive accuracy but I think this book covers some of it (other less wrong readers are likely to know more).
Culturally, we also apply a prior on the black box itself, i.e. when Einstein proposed theories, people rationally assumed they were more likely as given the limited data his black box seemed to suffer from less overfitting, of course we have few samples and don’t know the variance so he could just have been lucky.
Another perspective is that if we cannot obtain any more information on a subject is it valuable to continue to try and model it. In effect, the data is the answer and predictive power is irrelevant as no more data can be obtained.
but I’m not sure to what extent such a prior can rationally be determined, i.e. the pattern of likely theories, of which simplicity is a factor.
A theory that takes one more bit is less than twice as likely. Either that, or all finite theories have infinitesimal likelihoods. I can’t tell you how much less than twice, and I can’t tell you what compression algorithm you’re using. Trying to program the compression algorithm only means that the language you just used is the algorithm.
Technically, the extra bit thing is only as the amount of data goes to infinity, but that’s equivalent to the compression algorithm part.
I also assign anything with infinities as having zero probability, because otherwise paradoxes of infinity would break probability and ethics.
Is it correct, to say that the bit based prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?
If so I was wondering if it could be extended to reflect other aesthetics. For example, if an experiment produces a single result that is inconsistent with an existing simple physics theory, it may be that the simplest theory that explains this data is to treat this result as an isolated exception, however, aesthetically we find it more plausible that this exception is evidence of a larger theory that the sample is one part of.
In contrast when attempting to understand the rules of a human system (e.g. a bureaucracy) constructing a theory that lacked exceptions seems unlikely (“that’s a little too neat”). Indeed when stated informally the phrase might go “in my experience, that’s a little too neat” implying that we formulate priors based on learned patterns from experience. In the case of the bureaucracy, this may stem from a probabilistic understanding of the types of system that result from a particular ‘maker’ (i.e. politics).
However, this moves the problem to one of classifying contexts and determining which contexts are relevant, if this process is considered part of the theory, then it may considerably increase its complexity always preferring theories which ignore context. Unless of course the theory is complete (incorporating all contexts) in which case the simplest theory may share these contextual models and thus become the universal simplest model. It would therefore not be rational to apply Kolmogorov complexity to a problem in isolation. I.e. probability and reductionism are not compatible.
I think a sensible step would be to treat a human as a black box statistical learning machine which can produce patterns (systems, theories etc.) that simplify the data and can be used to make predictions. A human has no special qualities that distinguishes it from automated approaches such as a support vector machine. These solutions can be seen as necessary approximations to the correct space of possible hypotheses from which predictions may be impractical (particularly as priors are unknown).
One way of having confidence in a particular output from these black boxes is the use of some additional prior over the likelihood of different theories (their elegance if you like), but I’m not sure to what extent such a prior can rationally be determined, i.e. the pattern of likely theories, of which simplicity is a factor.
Another approach is the scientific method, model with a subset and then validate with additional data (a common AI approach to minimise overfitting) I am not sufficiently knowledgeable in statistical learning theory to know how (or if) such approaches can be shown to provably improve predictive accuracy but I think this book covers some of it (other less wrong readers are likely to know more).
Culturally, we also apply a prior on the black box itself, i.e. when Einstein proposed theories, people rationally assumed they were more likely as given the limited data his black box seemed to suffer from less overfitting, of course we have few samples and don’t know the variance so he could just have been lucky.
Another perspective is that if we cannot obtain any more information on a subject is it valuable to continue to try and model it. In effect, the data is the answer and predictive power is irrelevant as no more data can be obtained.
A theory that takes one more bit is less than twice as likely. Either that, or all finite theories have infinitesimal likelihoods. I can’t tell you how much less than twice, and I can’t tell you what compression algorithm you’re using. Trying to program the compression algorithm only means that the language you just used is the algorithm.
Technically, the extra bit thing is only as the amount of data goes to infinity, but that’s equivalent to the compression algorithm part.
I also assign anything with infinities as having zero probability, because otherwise paradoxes of infinity would break probability and ethics.
That’s the extent to which it can be done.
Interesting,
Is it correct, to say that the bit based prior is a consequence of creating an internally consistent formalisation of the aesthetic heuristic of preferring simpler structures to complex ones?
If so I was wondering if it could be extended to reflect other aesthetics. For example, if an experiment produces a single result that is inconsistent with an existing simple physics theory, it may be that the simplest theory that explains this data is to treat this result as an isolated exception, however, aesthetically we find it more plausible that this exception is evidence of a larger theory that the sample is one part of.
In contrast when attempting to understand the rules of a human system (e.g. a bureaucracy) constructing a theory that lacked exceptions seems unlikely (“that’s a little too neat”). Indeed when stated informally the phrase might go “in my experience, that’s a little too neat” implying that we formulate priors based on learned patterns from experience. In the case of the bureaucracy, this may stem from a probabilistic understanding of the types of system that result from a particular ‘maker’ (i.e. politics).
However, this moves the problem to one of classifying contexts and determining which contexts are relevant, if this process is considered part of the theory, then it may considerably increase its complexity always preferring theories which ignore context. Unless of course the theory is complete (incorporating all contexts) in which case the simplest theory may share these contextual models and thus become the universal simplest model. It would therefore not be rational to apply Kolmogorov complexity to a problem in isolation. I.e. probability and reductionism are not compatible.