So one of the first thoughts I had when reading this was whether you can model any Radical Probabilist as a Bayesian agent that has some probability mass on “my assumptions are wrong” and will have that probability mass increase so that it questions its assumptions over a “reasonable timeframe” for whatever definition.
For the case of coin flips, there is a clear assumption in the naive model that the coin flips are independent of each other, which can be fairly simply expressed as $P(flip_i = H | flip_{j} = H) = P(flip_i = H | flip_{j} = T) \forall j < i$. In the case of the coin that flips 1 heads, 5 tails, 25 heads, 125 tails, just evaluating j=i-1 through the 31st flip gives P(H|last flip heads) = 24⁄25, P(H|last flip tails) = 1⁄5, which is unlikely at p=~1e-4, which is approximately the difference in bayesian weight between the hypothesis H1: the coin flips heads 26⁄31 times (P(E|H1)=~1e-6) and H0: the coin flips heads unpredictably (1/2 the time, P(E|H0)=~4e-10) which is a better hypothesis in the long run until you expand your hypothesis space.
So in this case, the “I don’t have the hypothesis in my space” hypothesis actually wins out right around the 30th-32nd flip, possibly about the same time a human would be identifying the alternate hypothesis. That seems helpful!
However this relies on the fact that this specific hypothesis has a single very clear assumption and there is a single very clear calculation that can be done to test that assumption. Even in this case though, the “independence of all coin flips” assumption makes a bunch more predictions, like that coin flips two apart are independent, etc. calculating all of these may be theoretically possible but it’s arduous in practice, and would give rise to far too much false evidence—for example, in real life there are often distributions that look a lot like normal distributions in the general sense that over half the data is within one standard deviation of the mean and 90% of the data is within two standard deviations, but where if you apply an actual hypothesis test of whether the data is normally distributed it will point out some ways that it isn’t exactly normal (only 62% of the data is in this region, not 68%! etc.).
It seems like the idea of having a specific hypothesis in your space labeled “I don’t have the right hypothesis in my space” can work okay under the conditions
1. You have a clearly stated assumption which defines your current hypothesis space
2. You have a clear statistical test which shows when data doesn’t match your hypothesis space
3. You know how much data needs to be present for that test to be valid—both in terms of the minimum for it to distinguish itself so you don’t follow conspiracy theories, and something like a maximum (maybe this will naturally emerge from tracking the probability of the data given the null hypothesis, maybe not).
I have no idea whether these conditions are reasonable “in practice” whatever that means, so I’m not really clear whether this framework is useful, but it’s what I thought of and I want to share even negative results in case other people had the same thoughts.
Yeah, I don’t think this can be generalized to model a radical probabilist in general, but it does seem like a relevant example of “extra-bayesian” (but not totally non-bayesian) calculations which can be performed to supplement Bayesian updates in practice.
It seems like you don’t need statistical tests, and can instead include a special “Socratic” hypothesis (which just says “I don’t know”) in your hypothesis space. This hypothesis can assign some fixed or time-varying probability to any observation (e.g. yielding an unnormalized probability distribution by saying P(X=r) = epsilon for any real number r, assuming all observations X are real-valued). I wonder if that has been explored.
If you don’t have statistical tests then I don’t see how you have a principled way to update away from your structured hypotheses, since the structured space will always give strictly better predictions than the socratic hypothesis.
the structured space will always give strictly better predictions than the socratic hypothesis.
I don’t think so.… suppose in the H/T example, the Socratic hypothesis says that P(H) = P(T) = 3. Then it will always do better than any hypothesis that has to be normalized.
I’m not sure what you mean by “structured hypotheses” here though...
In the case where you get 1 heads, 5 tails, 25 heads, etc., then at every point in time, and you are working with the assumption that the flips are independent, then the Bayesian hypothesis will never converge, but it will actually give better predictions than the Socratic hypothesis most of the time. In particular when it’s halfway through one of the powers of five, it will assign P>.5 to the correct prediction every time. And if you expand that to assuming the flip can depend on the previous flip, it will get to a hypothesis (the next flip will be the same as the last one) that actually performs VERY well, and does converge.
By “structured” I mean that I have a principled way of determining P(Evidence|Hypothesis); with the Socratic hypothesis I only have unprincipled ways of determining it.
I’m not sure what you mean by normalized, unless you mean that the Socratic hypothesis always gives probability 1 to the observed evidence, in which case it will dominate even the correct hypothesis if there is uncertainty.
It seems like you could get pretty far with this approach, and it starts to look pretty Bayesian to me if I update epsilon based on how predictable the world seems to have been, in general, so far.
So one of the first thoughts I had when reading this was whether you can model any Radical Probabilist as a Bayesian agent that has some probability mass on “my assumptions are wrong” and will have that probability mass increase so that it questions its assumptions over a “reasonable timeframe” for whatever definition.
For the case of coin flips, there is a clear assumption in the naive model that the coin flips are independent of each other, which can be fairly simply expressed as $P(flip_i = H | flip_{j} = H) = P(flip_i = H | flip_{j} = T) \forall j < i$. In the case of the coin that flips 1 heads, 5 tails, 25 heads, 125 tails, just evaluating j=i-1 through the 31st flip gives P(H|last flip heads) = 24⁄25, P(H|last flip tails) = 1⁄5, which is unlikely at p=~1e-4, which is approximately the difference in bayesian weight between the hypothesis H1: the coin flips heads 26⁄31 times (P(E|H1)=~1e-6) and H0: the coin flips heads unpredictably (1/2 the time, P(E|H0)=~4e-10) which is a better hypothesis in the long run until you expand your hypothesis space.
So in this case, the “I don’t have the hypothesis in my space” hypothesis actually wins out right around the 30th-32nd flip, possibly about the same time a human would be identifying the alternate hypothesis. That seems helpful!
However this relies on the fact that this specific hypothesis has a single very clear assumption and there is a single very clear calculation that can be done to test that assumption. Even in this case though, the “independence of all coin flips” assumption makes a bunch more predictions, like that coin flips two apart are independent, etc. calculating all of these may be theoretically possible but it’s arduous in practice, and would give rise to far too much false evidence—for example, in real life there are often distributions that look a lot like normal distributions in the general sense that over half the data is within one standard deviation of the mean and 90% of the data is within two standard deviations, but where if you apply an actual hypothesis test of whether the data is normally distributed it will point out some ways that it isn’t exactly normal (only 62% of the data is in this region, not 68%! etc.).
It seems like the idea of having a specific hypothesis in your space labeled “I don’t have the right hypothesis in my space” can work okay under the conditions
1. You have a clearly stated assumption which defines your current hypothesis space
2. You have a clear statistical test which shows when data doesn’t match your hypothesis space
3. You know how much data needs to be present for that test to be valid—both in terms of the minimum for it to distinguish itself so you don’t follow conspiracy theories, and something like a maximum (maybe this will naturally emerge from tracking the probability of the data given the null hypothesis, maybe not).
I have no idea whether these conditions are reasonable “in practice” whatever that means, so I’m not really clear whether this framework is useful, but it’s what I thought of and I want to share even negative results in case other people had the same thoughts.
Yeah, I don’t think this can be generalized to model a radical probabilist in general, but it does seem like a relevant example of “extra-bayesian” (but not totally non-bayesian) calculations which can be performed to supplement Bayesian updates in practice.
It seems like you don’t need statistical tests, and can instead include a special “Socratic” hypothesis (which just says “I don’t know”) in your hypothesis space. This hypothesis can assign some fixed or time-varying probability to any observation (e.g. yielding an unnormalized probability distribution by saying P(X=r) = epsilon for any real number r, assuming all observations X are real-valued). I wonder if that has been explored.
If you don’t have statistical tests then I don’t see how you have a principled way to update away from your structured hypotheses, since the structured space will always give strictly better predictions than the socratic hypothesis.
I don’t think so.… suppose in the H/T example, the Socratic hypothesis says that P(H) = P(T) = 3. Then it will always do better than any hypothesis that has to be normalized.
I’m not sure what you mean by “structured hypotheses” here though...
In the case where you get 1 heads, 5 tails, 25 heads, etc., then at every point in time, and you are working with the assumption that the flips are independent, then the Bayesian hypothesis will never converge, but it will actually give better predictions than the Socratic hypothesis most of the time. In particular when it’s halfway through one of the powers of five, it will assign P>.5 to the correct prediction every time. And if you expand that to assuming the flip can depend on the previous flip, it will get to a hypothesis (the next flip will be the same as the last one) that actually performs VERY well, and does converge.
By “structured” I mean that I have a principled way of determining P(Evidence|Hypothesis); with the Socratic hypothesis I only have unprincipled ways of determining it.
I’m not sure what you mean by normalized, unless you mean that the Socratic hypothesis always gives probability 1 to the observed evidence, in which case it will dominate even the correct hypothesis if there is uncertainty.
It seems like you could get pretty far with this approach, and it starts to look pretty Bayesian to me if I update epsilon based on how predictable the world seems to have been, in general, so far.