In the previous two parts of this sequence, I described the two different forms of the two-update problem and the attempt to solve them both at the same time by starting with a prior that knows nothing about logic and updating it on logical dependencies. The problem at the end of part 2 was that neither the simple coin-flip distribution nor the Solomonoff distribution remain well-defined when we condition on logic. A variant of the Solomonoff distribution with oracle machines can be updated in this way, but the result is not approximable.
Here, I discuss approaches to making such a thing approximable.
The reason that the Solomonoff distribution (powerful as it may be) cannot be updated on logic is that it outputs each bit sequentially, and there is no Turing machine which outputs tautologies in order from smallest to largest (and fails to ever output a contradiction); if there were such a machine, logical consequence would be decidable rather than only recursively enumerable.
One way to try and fix this is to ask the machines to enumerate logical sentences rather than decide them. We can use the same UTM (fed random bits) as in the Solomonoff distribution, but rather than letting the nth bit determine the truth value of the sentence with Goedel number n, we interpret its output as describing a sequence of sentences in arbitrary order. For example, n 1s in a row might mean “interpret the next n bits as a Goedel number of a sentence”. After those n bits are interpreted, then again we look for the length of the next sequence of 1s. And so on. We interpret the machine as asserting each sentence it enumerates in this manner. (This encoding isn’t particularly efficient, but it doesn’t have to be—the Turing machine can represent more efficient encodings.)
This includes the truths of logic as a hypothesis in the space, since there exists a Turing machine which takes the axioms and inference rules of logic and outputs each sentence which can be proved from those. Similarly, there exist modified machines with more axioms, so we have all the interesting logical hypotheses we could want.
How do we “update on logic?” As in the Demski prior, we can sample theories, but have a bounded-time consistency check running in the background and eliminate a sample if it’s shown to be inconsistent. If we keep increasing the consistency-check time while we’re sampling, we will eventually converge.
Unfortunately, although this results in a probability distribution on sentences, it is not a coherent logical prior. We are eliminating machines which produce contradictions, but we are not eliminating machines which fail to produce one of X or ¬X. The probabilities will sum to less than one.
We can fiddle with this to make things more palatable. One important improvement is to augment the sentences enumerated by the machine with everything which can be proven from them. This still does not make the measure of X and ¬X sum to one, but it does make it so that A⊢B gives us P(A)≤P(B). This amounts to Dempster-Shafer theory with a universal prior attached.
We can make the probabilities sum to one by repeatedly drawing sentence-enumerators from this prior, requiring each new one to be consistent with the previous so far. This is the idea which led to the modified Demski prior.
Unfortunately, none of these variations really address the two-update problem. Because the modified Demski prior uses repeated draws to make a complete distribution rather than updating on logic, it’s still quite different to add X as an axiom vs update on it. The Dempster-Shafer version also has a kind of two-update problem; updating on X being an output of the process is a different thing than updating on X in the Dempster-Schafer style (eliminating outcomes inconsistent with X and then adding logical consequences of X).
Furthermore, asking machines to enumerate sentences rather than make predictions for all sentences seems to have bad side-effects. If a machine can pick and choose when to speak and when to stay silent, it isn’t forced to stick its neck out and make a judgement; as a result, it does not learn well.
The next idea (which was due either to Scott or Benja) was something we called backup machines.
A backup machine is a generalized Turing machine which is allowed to modify its output tape, but boundedly often in a way that ensures it will converge to one output (although we cannot tell in general whether it has converged yet). The most basic kind of backup machine works as follows:
Each normal spot on the tape is accompanied by an extra spot for an “edit mark”. The edit marks are initially 0. A backup machine is allowed to go back and change outputs so long as the edit mark for that output is currently 0. When it does so, the edit-mark for the output edited is set to 1, but all subsequent edit marks are reset to 0. At any given time, then, we can edit exactly those outputs which haven’t been edited later than any strictly earlier outputs. This means the machine could change its mind about the nth position up to n times, since the nth edit mark could be reset that many times.
Now we want to apply these machines to logic. Interpret the nth place on the output tape to refer to the truth or falsehood of the nth sentence by Goedel-number.
Backup machines are sufficient to simulate the sentence-enumeration prior I’ve been discussing: we can output all zeros, initially, while running an arbitrary sentence-enumerator in the background. When a sentence is enumerated, we back up to the sentence’s location and set it to 1. This means that we assign nonzero probability to logic, and therefore can update on logic directly (which is our goal). Unlike the sentence-enumeration case, though, we’re assigning a truth value to every sentence -- X and ¬X will sum to 1.
What’s not clear yet is whether updating on logic will give us an approximable distribution. Backup machines themselves are approximable, but this doesn’t mean that constraining them to logically consistent results is. We haven’t spent a lot of time thinking about this. If so, it does seem to solve the two-update problem. It also seems unlikely to have the learning problem which the Demski / modified Demski priors have.
If desired, it’s possible to make increasingly powerful kinds of backup machines with ordinal notations. Fix an ordinal notation, and allow the edit-mark spots to grow to hold this notation. We now mark outputs with ordinals rather than 0 or 1. When an output bit is first written, we choose an ordinal which must get smaller with each edit. When the ordinal is zero, edits are no longer allowed. When we do edit an output, we get an opportunity to re-choose the ordinals for subsequent outputs.
Examples: if we allow natural numbers, then the machine is choosing how many times it can re-write a given spot. If we allow natural numbers plus ω, then the machine can put off choosing a natural number until its first edit of a location; then, it must decrease the ordinal from omega to a finite number of its choice. If we allow ω+n for natural n, then the machine gets to choose how long it defers choice of a natural number. And so on.
The Two-Update Problem, Part 3
Relevant posts: Part 1, Part 2, No good conditional, Monotonicity.
In the previous two parts of this sequence, I described the two different forms of the two-update problem and the attempt to solve them both at the same time by starting with a prior that knows nothing about logic and updating it on logical dependencies. The problem at the end of part 2 was that neither the simple coin-flip distribution nor the Solomonoff distribution remain well-defined when we condition on logic. A variant of the Solomonoff distribution with oracle machines can be updated in this way, but the result is not approximable.
Here, I discuss approaches to making such a thing approximable.
The reason that the Solomonoff distribution (powerful as it may be) cannot be updated on logic is that it outputs each bit sequentially, and there is no Turing machine which outputs tautologies in order from smallest to largest (and fails to ever output a contradiction); if there were such a machine, logical consequence would be decidable rather than only recursively enumerable.
One way to try and fix this is to ask the machines to enumerate logical sentences rather than decide them. We can use the same UTM (fed random bits) as in the Solomonoff distribution, but rather than letting the nth bit determine the truth value of the sentence with Goedel number n, we interpret its output as describing a sequence of sentences in arbitrary order. For example, n 1s in a row might mean “interpret the next n bits as a Goedel number of a sentence”. After those n bits are interpreted, then again we look for the length of the next sequence of 1s. And so on. We interpret the machine as asserting each sentence it enumerates in this manner. (This encoding isn’t particularly efficient, but it doesn’t have to be—the Turing machine can represent more efficient encodings.)
This includes the truths of logic as a hypothesis in the space, since there exists a Turing machine which takes the axioms and inference rules of logic and outputs each sentence which can be proved from those. Similarly, there exist modified machines with more axioms, so we have all the interesting logical hypotheses we could want.
How do we “update on logic?” As in the Demski prior, we can sample theories, but have a bounded-time consistency check running in the background and eliminate a sample if it’s shown to be inconsistent. If we keep increasing the consistency-check time while we’re sampling, we will eventually converge.
Unfortunately, although this results in a probability distribution on sentences, it is not a coherent logical prior. We are eliminating machines which produce contradictions, but we are not eliminating machines which fail to produce one of X or ¬X. The probabilities will sum to less than one.
We can fiddle with this to make things more palatable. One important improvement is to augment the sentences enumerated by the machine with everything which can be proven from them. This still does not make the measure of X and ¬X sum to one, but it does make it so that A⊢B gives us P(A)≤P(B). This amounts to Dempster-Shafer theory with a universal prior attached.
We can make the probabilities sum to one by repeatedly drawing sentence-enumerators from this prior, requiring each new one to be consistent with the previous so far. This is the idea which led to the modified Demski prior.
Unfortunately, none of these variations really address the two-update problem. Because the modified Demski prior uses repeated draws to make a complete distribution rather than updating on logic, it’s still quite different to add X as an axiom vs update on it. The Dempster-Shafer version also has a kind of two-update problem; updating on X being an output of the process is a different thing than updating on X in the Dempster-Schafer style (eliminating outcomes inconsistent with X and then adding logical consequences of X).
Furthermore, asking machines to enumerate sentences rather than make predictions for all sentences seems to have bad side-effects. If a machine can pick and choose when to speak and when to stay silent, it isn’t forced to stick its neck out and make a judgement; as a result, it does not learn well.
The next idea (which was due either to Scott or Benja) was something we called backup machines.
A backup machine is a generalized Turing machine which is allowed to modify its output tape, but boundedly often in a way that ensures it will converge to one output (although we cannot tell in general whether it has converged yet). The most basic kind of backup machine works as follows:
Each normal spot on the tape is accompanied by an extra spot for an “edit mark”. The edit marks are initially 0. A backup machine is allowed to go back and change outputs so long as the edit mark for that output is currently 0. When it does so, the edit-mark for the output edited is set to 1, but all subsequent edit marks are reset to 0. At any given time, then, we can edit exactly those outputs which haven’t been edited later than any strictly earlier outputs. This means the machine could change its mind about the nth position up to n times, since the nth edit mark could be reset that many times.
Now we want to apply these machines to logic. Interpret the nth place on the output tape to refer to the truth or falsehood of the nth sentence by Goedel-number.
Backup machines are sufficient to simulate the sentence-enumeration prior I’ve been discussing: we can output all zeros, initially, while running an arbitrary sentence-enumerator in the background. When a sentence is enumerated, we back up to the sentence’s location and set it to 1. This means that we assign nonzero probability to logic, and therefore can update on logic directly (which is our goal). Unlike the sentence-enumeration case, though, we’re assigning a truth value to every sentence -- X and ¬X will sum to 1.
What’s not clear yet is whether updating on logic will give us an approximable distribution. Backup machines themselves are approximable, but this doesn’t mean that constraining them to logically consistent results is. We haven’t spent a lot of time thinking about this. If so, it does seem to solve the two-update problem. It also seems unlikely to have the learning problem which the Demski / modified Demski priors have.
If desired, it’s possible to make increasingly powerful kinds of backup machines with ordinal notations. Fix an ordinal notation, and allow the edit-mark spots to grow to hold this notation. We now mark outputs with ordinals rather than 0 or 1. When an output bit is first written, we choose an ordinal which must get smaller with each edit. When the ordinal is zero, edits are no longer allowed. When we do edit an output, we get an opportunity to re-choose the ordinals for subsequent outputs.
Examples: if we allow natural numbers, then the machine is choosing how many times it can re-write a given spot. If we allow natural numbers plus ω, then the machine can put off choosing a natural number until its first edit of a location; then, it must decrease the ordinal from omega to a finite number of its choice. If we allow ω+n for natural n, then the machine gets to choose how long it defers choice of a natural number. And so on.