These results from my conversations with Charlie Steiner at the May 29-31 MIRI Workshop on Logical Uncertainty will primarily be of interest to people who’ve read section 2.4 of Paul Christiano’s Non-Omniscience paper.
If we write a reasoner that keeps track of probabilities of a collection of sentences φ1,…,φn (that grows and shrinks as the reasoner explores), we need some way of tracking known relationships between the sentences. One way of doing this is to store the pairwise probability distributions, ie not only Pr(φi) for all i but also Pr(φi∧φj) for all i,j.
If we do this, a natural question to ask is: how can we update this data structure if we learn that eg φ1 is true?
We’ll refer to the updated probabilities as Pr(⋅|φ1).
It’s fairly reasonable for us to want to set Pr(φi|φ1):=Pr(φi∧φ1)/Pr(φ1); however, it’s less clear what values to assign to Pr(φi∧φj|φ1), because we haven’t stored Pr(φi∧φj∧φ1).
One option would be to find the maximum entropy distribution over truth assignments to φ1,…,φn under the constraint that the stored pairwise distributions are correct. This seems intractable for large n; however, in the spirit of locality, we could restrict our attention to the joint truth value distribution of φ1,φi,φj. Maximizing its entropy is simple (it boils down to either convex optimization or solving a cubic), and yields a plausible candidate for Pr(φi∧φj∧φ1) that we can derive Pr(φi∧φj|φ1) from. I’m not sure what global properties this has, for example whether it yields a positive semidefinite matrix (Pr(φi∧φj))i,j.
A different option, as noted in section 2.4.2, is to observe that the matrix (Pr(φi∧φj))i,j must be positive semidefinite under any joint distribution for the truth values. This means we can consider a zero-mean multivariate normal distribution with this matrix as its covariance; then there’s a closed-form expression for the Kullback-Leibler divergence of two such distributions, and this can be used to define a sort of conditional distribution, as is done in section 2.4.3.
However, as the paper remarks, this isn’t a very familiar way of defining these updated probabilities. For example, it lacks the desirable property that Pr(φi|φ1)=Pr(φi∧φ1)/Pr(φ1).
Fortunately, there is a natural construction that combines these ideas: namely, if we consider the maximum-entropy distribution for the truth assignment vector (1φ1,…,1φn) with the given second moments E(1φi1φj), but relax the requirement that their values be in {0,1}, then we find a multivariate normal distribution N((Pr(φi))i,(Pr(φi∧φj)−Pr(φi)Pr(φj))i,j). If we wish to update this distribution after observing φ1 by finding the candidate distribution (1φ1,…,1φn|φ1) of highest relative entropy with Pr(1φ1=1|φ1)=1, as proposed in the paper, then we will get the multivariate normal conditional distribution N⎛⎝(Pr(φ1∧φi)/Pr(φ1))i,(Pr(φi∧φj)−Pr(φi)Pr(φj)−(Pr(φ1∧φi)−Pr(φ1)Pr(φi))(Pr(φ1∧φj)−Pr(φ1)Pr(φj))Pr(φ1)−Pr(φ1)2)ij⎞⎠.
Note that this generally has Var(1φi∣∣φ1)≠E(1φi∣∣φ1)(1−E(1φi∣∣φ1)), which is a mismatch; this is related to the fact that a conditional variance in a multivariate normal is never higher than the marginal variance, which is an undesirable feature for a distribution over truth-values.
This is also related to other undesirable features; for example, if we condition on more than one sentence, we can arrive at conditional probabilities outside of [0,1]. (For example if 3 sentences have Pr(φ1)=Pr(φ2)=Pr(φ3)=13,Pr(φ1∧φ2)=Pr(φ1∧φ3)=Pr(φ2∧φ3)=ε then this yields Pr(φ3|φ1,φ2)=−1+15ε1+9ε≈−1; this makes sense because this prior is very confident that 1φ1+1φ2+1φ3≈1, with standard deviation √6ε.)
Intermediate relaxations that lack these particular shortcomings are possible, such as the ones that restrict the relaxed 1φ1,…,1φn to the sphere ∑i(2xi−1)2=n or ball ∑i(2xi−1)2≤n. Then the maximum entropy distribution, similarly to a multivariate normal distribution, has quadratic logdensity, though the Hessian of the quadratic may have nonnegative eigenvalues (unlike in the normal case). In the spherical case, this is known as a Fisher-Bingham distribution.
Both of these relaxations seem difficult to work with, eg to compute normalizing constants for; furthermore I don’t think the analogous updating process will share the desirable property that Pr(φi|φ1)=Pr(φi∧φ1)/Pr(φ1). However, the fact that these distributions allow updating by relaxed conditioning, keep (fully conditioned) truth-values between 0 and 1, and have reasonable (at least, possibly-increasing) behavior for conditional variances, makes them seem potentially appealing.
A tractable, interpretable formulation of approximate conditioning for pairwise-specified probability distributions over truth values
These results from my conversations with Charlie Steiner at the May 29-31 MIRI Workshop on Logical Uncertainty will primarily be of interest to people who’ve read section 2.4 of Paul Christiano’s Non-Omniscience paper.
If we write a reasoner that keeps track of probabilities of a collection of sentences φ1,…,φn (that grows and shrinks as the reasoner explores), we need some way of tracking known relationships between the sentences. One way of doing this is to store the pairwise probability distributions, ie not only Pr(φi) for all i but also Pr(φi∧φj) for all i,j.
If we do this, a natural question to ask is: how can we update this data structure if we learn that eg φ1 is true?
We’ll refer to the updated probabilities as Pr(⋅|φ1).
It’s fairly reasonable for us to want to set Pr(φi|φ1):=Pr(φi∧φ1)/Pr(φ1); however, it’s less clear what values to assign to Pr(φi∧φj|φ1), because we haven’t stored Pr(φi∧φj∧φ1).
One option would be to find the maximum entropy distribution over truth assignments to φ1,…,φn under the constraint that the stored pairwise distributions are correct. This seems intractable for large n; however, in the spirit of locality, we could restrict our attention to the joint truth value distribution of φ1,φi,φj. Maximizing its entropy is simple (it boils down to either convex optimization or solving a cubic), and yields a plausible candidate for Pr(φi∧φj∧φ1) that we can derive Pr(φi∧φj|φ1) from. I’m not sure what global properties this has, for example whether it yields a positive semidefinite matrix (Pr(φi∧φj))i,j.
A different option, as noted in section 2.4.2, is to observe that the matrix (Pr(φi∧φj))i,j must be positive semidefinite under any joint distribution for the truth values. This means we can consider a zero-mean multivariate normal distribution with this matrix as its covariance; then there’s a closed-form expression for the Kullback-Leibler divergence of two such distributions, and this can be used to define a sort of conditional distribution, as is done in section 2.4.3.
However, as the paper remarks, this isn’t a very familiar way of defining these updated probabilities. For example, it lacks the desirable property that Pr(φi|φ1)=Pr(φi∧φ1)/Pr(φ1).
Fortunately, there is a natural construction that combines these ideas: namely, if we consider the maximum-entropy distribution for the truth assignment vector (1φ1,…,1φn) with the given second moments E(1φi1φj), but relax the requirement that their values be in {0,1}, then we find a multivariate normal distribution N((Pr(φi))i,(Pr(φi∧φj)−Pr(φi)Pr(φj))i,j). If we wish to update this distribution after observing φ1 by finding the candidate distribution (1φ1,…,1φn|φ1) of highest relative entropy with Pr(1φ1=1|φ1)=1, as proposed in the paper, then we will get the multivariate normal conditional distribution N⎛⎝(Pr(φ1∧φi)/Pr(φ1))i,(Pr(φi∧φj)−Pr(φi)Pr(φj)−(Pr(φ1∧φi)−Pr(φ1)Pr(φi))(Pr(φ1∧φj)−Pr(φ1)Pr(φj))Pr(φ1)−Pr(φ1)2)ij⎞⎠.
Note that this generally has Var(1φi∣∣φ1)≠E(1φi∣∣φ1)(1−E(1φi∣∣φ1)), which is a mismatch; this is related to the fact that a conditional variance in a multivariate normal is never higher than the marginal variance, which is an undesirable feature for a distribution over truth-values.
This is also related to other undesirable features; for example, if we condition on more than one sentence, we can arrive at conditional probabilities outside of [0,1]. (For example if 3 sentences have Pr(φ1)=Pr(φ2)=Pr(φ3)=13,Pr(φ1∧φ2)=Pr(φ1∧φ3)=Pr(φ2∧φ3)=ε then this yields Pr(φ3|φ1,φ2)=−1+15ε1+9ε≈−1; this makes sense because this prior is very confident that 1φ1+1φ2+1φ3≈1, with standard deviation √6ε.)
Intermediate relaxations that lack these particular shortcomings are possible, such as the ones that restrict the relaxed 1φ1,…,1φn to the sphere ∑i(2xi−1)2=n or ball ∑i(2xi−1)2≤n. Then the maximum entropy distribution, similarly to a multivariate normal distribution, has quadratic logdensity, though the Hessian of the quadratic may have nonnegative eigenvalues (unlike in the normal case). In the spherical case, this is known as a Fisher-Bingham distribution.
Both of these relaxations seem difficult to work with, eg to compute normalizing constants for; furthermore I don’t think the analogous updating process will share the desirable property that Pr(φi|φ1)=Pr(φi∧φ1)/Pr(φ1). However, the fact that these distributions allow updating by relaxed conditioning, keep (fully conditioned) truth-values between 0 and 1, and have reasonable (at least, possibly-increasing) behavior for conditional variances, makes them seem potentially appealing.