Yeah, the right column should obviously be all 20s. There must be a bug in my code[1] :/
I like to think of the argmax function as something that takes in a distribution on probability distributions on W with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.
Take the following hypothesis h3:
If I add this into P with weight 10−9, then the middle column is still nearly zero. But I can now ask for the probablity of the event in h3 corresponding to the center square, and I get back an answer very close to zero. Where did this confidence come from?
I guess I’m basically wondering what this procedure is aspiring to be. Some candidates I have in mind:
Extension to the coarse case of regular hypothesis mixing (where we go from P(w) and Q(w) to aP(w)+(1−a)Q(w))
Extension of some kind of Bayesian update-flavored thing where we go to P(w)Q(w) then renormalize
ETA: P(w)aQ(w)1−a seems more plausible than P(w)Q(w)
Some kind of “aggregation of experts who we trust a lot unless they contradict each other”, which isn’t cleanly analogous to either of the above
Even in case 3, the near-zeros are really weird. The only cases I can think of where it makes sense are things like “The events are outcomes of a quantum process. Physics technique 1 creates hypothesis 1, and technique 2 creates hypothesis 2. Both techniques are very accurate, and the uncertainity they express is due to fundamental unknowability. Since we know both tables are correct, we can confidently rule out the middle column, and thus rule out certain events in hypothesis 3.”
But more typically, the uncertainity is in the maps of the respective hypotheses, not in the territory, in which case the middle zeros seem unfounded. And to be clear, the reason it seems like a real issue[2] is that when you add in hypothesis 3 you have events in the middle which you can query, but the values can stay arbitrarily close to zero if you add in hypothesis 3 with low weight.
This maps the credence but I would imagine that the confidence would not be evenly spread around the boxes. With confidence literally 0 it does not make sense to express any credence to stand any taller than another as 1 and 0 would make equal sense. With a miniscule confidence the foggy hunch does point in some direction.
Without h3 it is consistent to have middle square confidence 0. With positive plausibily of h3 middle square is not “glossed over” we have some confidence it might matter. But because h3 is totally useless for credences those come from the structures of h1 and h2. Thus effectively h1 and h2 are voting for zero despite not caring about it.
Contrast what would happen with an even more trivial hypothesis of one square covering all with 100% or 9x9 equiprobable hypothesis.
You could also have a “micro detail hypothesis”, (actually a 3x3) a 9x9 grid where each 3x3 is zeroes everywhere else than the bottom right corner and all the “small square locations” are in the same case among the other “big square” correspondents. The “big scale” hypotheses do not really mind the “small scale” dragging of the credence around. Thus the small bottom-right square is quite sensitive to the corresponding big square value and the other small squares are relatively insensitive. Mixing two 3x3 resolutions that are orthogonal results in a 9x9 resolution which is sparse (because it is separable). John Vervaeke meme of “sterescopic vision” seems to apply. The two 2x2 perspectives are not entirely orthogonal so the “sparcity” is not easy to catch.
The point I was trying to make with the partial functions was something like “Yeah, there are 0s, yeah it is bad, but at least we can never assign low probability to any event that any of the hypotheses actually cares about.” I guess I could have make that argument more clearly if instead, I just pointed out that any event in the sigma algebra of any of the hypotheses will have probability at least equal to the probability of that hypothesis times the probability of that event in that hypothesis. Thus the 0s (and the 10−9s) are really coming from the fact that (almost) nobody cares about those events.
I agree with all your intuition here. The thing about the partial functions is unsatisfactory, because it is discontinuous.
It is trying to be #1, but a little more ambitious. I want the distribution on distributions to be a new type of epistemic state, and the geometric maximization to be the mechanism for converting the new epistemic state to a traditional probability distribution. I think that any decent notion of an embedded epistemic state needs to be closed under both mixing and coarsening, and this is trying to satisfy that as naturally as possible.
I think that the 0s are pretty bad, but I think they are the edge case of the only reasonable thing to do here. I think the reason it feels like the only reasonable thing to do for me is something like credit assignment/hypothesis autonomy. If a world gets probability mass, that should be because some hypothesis or collection of hypotheses insisted on putting probability mass there. You gave an edge case example where this didn’t happen. Maybe everything is edge cases. I am not sure.
It might be that the 0s are not as bad as they seem. 0s seem bad because we have cached that “0 means you cant update” but maybe you aren’t supposed to be updating in the output distribution anyway, you are supposed to do you updating in the more general epistemic state input object.
I actually prefer a different proposal for the type of “epistemic state that is closed under coarsening and mixture” that is more general than the thing I gesture at in the post:
A generalized epistemic state is a (quasi-?)convex function ΔW→R. A standard probability distribution is converted to an epistemic state through P↦(Q↦DKL(P||Q)). A generalized epistemic state is converted to a (convex set of) probability distribution(s) by taking an argmin. Mixture is mixture as functions, and coarsening is the obvious thing (given a function W→V, we can convert a generalized epistemic state over V to a generalized epistemic state over W by precomposing with the obvious function from ΔW to ΔV.)
The above proposal comes together into the formula we have been talking about, but you can also imagine having generalized epistemic states that didn’t come from mixtures of coarse distributions.
Yeah, the right column should obviously be all 20s. There must be a bug in my code[1] :/
Take the following hypothesis h3:
If I add this into P with weight 10−9, then the middle column is still nearly zero. But I can now ask for the probablity of the event in h3 corresponding to the center square, and I get back an answer very close to zero. Where did this confidence come from?
I guess I’m basically wondering what this procedure is aspiring to be. Some candidates I have in mind:
Extension to the coarse case of regular hypothesis mixing (where we go from P(w) and Q(w) to aP(w)+(1−a)Q(w))
Extension of some kind of Bayesian update-flavored thing where we go to P(w)Q(w) then renormalize
ETA: P(w)aQ(w)1−a seems more plausible than P(w)Q(w)
Some kind of “aggregation of experts who we trust a lot unless they contradict each other”, which isn’t cleanly analogous to either of the above
Even in case 3, the near-zeros are really weird. The only cases I can think of where it makes sense are things like “The events are outcomes of a quantum process. Physics technique 1 creates hypothesis 1, and technique 2 creates hypothesis 2. Both techniques are very accurate, and the uncertainity they express is due to fundamental unknowability. Since we know both tables are correct, we can confidently rule out the middle column, and thus rule out certain events in hypothesis 3.”
But more typically, the uncertainity is in the maps of the respective hypotheses, not in the territory, in which case the middle zeros seem unfounded. And to be clear, the reason it seems like a real issue[2] is that when you add in hypothesis 3 you have events in the middle which you can query, but the values can stay arbitrarily close to zero if you add in hypothesis 3 with low weight.
ETA: Found the bug, it was fixable by substituting a single character
Rather than “if a zero falls in the forest and no hypothesis is around to hear it, does it really make a sound?”
This maps the credence but I would imagine that the confidence would not be evenly spread around the boxes. With confidence literally 0 it does not make sense to express any credence to stand any taller than another as 1 and 0 would make equal sense. With a miniscule confidence the foggy hunch does point in some direction.
Without h3 it is consistent to have middle square confidence 0. With positive plausibily of h3 middle square is not “glossed over” we have some confidence it might matter. But because h3 is totally useless for credences those come from the structures of h1 and h2. Thus effectively h1 and h2 are voting for zero despite not caring about it.
Contrast what would happen with an even more trivial hypothesis of one square covering all with 100% or 9x9 equiprobable hypothesis.
You could also have a “micro detail hypothesis”, (actually a 3x3) a 9x9 grid where each 3x3 is zeroes everywhere else than the bottom right corner and all the “small square locations” are in the same case among the other “big square” correspondents. The “big scale” hypotheses do not really mind the “small scale” dragging of the credence around. Thus the small bottom-right square is quite sensitive to the corresponding big square value and the other small squares are relatively insensitive. Mixing two 3x3 resolutions that are orthogonal results in a 9x9 resolution which is sparse (because it is separable). John Vervaeke meme of “sterescopic vision” seems to apply. The two 2x2 perspectives are not entirely orthogonal so the “sparcity” is not easy to catch.
The point I was trying to make with the partial functions was something like “Yeah, there are 0s, yeah it is bad, but at least we can never assign low probability to any event that any of the hypotheses actually cares about.” I guess I could have make that argument more clearly if instead, I just pointed out that any event in the sigma algebra of any of the hypotheses will have probability at least equal to the probability of that hypothesis times the probability of that event in that hypothesis. Thus the 0s (and the 10−9s) are really coming from the fact that (almost) nobody cares about those events.
I agree with all your intuition here. The thing about the partial functions is unsatisfactory, because it is discontinuous.
It is trying to be #1, but a little more ambitious. I want the distribution on distributions to be a new type of epistemic state, and the geometric maximization to be the mechanism for converting the new epistemic state to a traditional probability distribution. I think that any decent notion of an embedded epistemic state needs to be closed under both mixing and coarsening, and this is trying to satisfy that as naturally as possible.
I think that the 0s are pretty bad, but I think they are the edge case of the only reasonable thing to do here. I think the reason it feels like the only reasonable thing to do for me is something like credit assignment/hypothesis autonomy. If a world gets probability mass, that should be because some hypothesis or collection of hypotheses insisted on putting probability mass there. You gave an edge case example where this didn’t happen. Maybe everything is edge cases. I am not sure.
It might be that the 0s are not as bad as they seem. 0s seem bad because we have cached that “0 means you cant update” but maybe you aren’t supposed to be updating in the output distribution anyway, you are supposed to do you updating in the more general epistemic state input object.
I actually prefer a different proposal for the type of “epistemic state that is closed under coarsening and mixture” that is more general than the thing I gesture at in the post:
A generalized epistemic state is a (quasi-?)convex function ΔW→R. A standard probability distribution is converted to an epistemic state through P↦(Q↦DKL(P||Q)). A generalized epistemic state is converted to a (convex set of) probability distribution(s) by taking an argmin. Mixture is mixture as functions, and coarsening is the obvious thing (given a function W→V, we can convert a generalized epistemic state over V to a generalized epistemic state over W by precomposing with the obvious function from ΔW to ΔV.)
The above proposal comes together into the formula we have been talking about, but you can also imagine having generalized epistemic states that didn’t come from mixtures of coarse distributions.