I’m not sure the counterexample given here works as stated. What follows is my attempt to go through details. (I still thing SIA is doing something wrong and is unlikely to work, but it is important to get the details right to be sure.)
As I understood it, the problem was as follows. (Most of this is just re-wording of things you said in the post.)
We wanted to design SIA such that if there is an optimal heuristic for a quickly computable sub-sequence of E, then it would learn to apply that heuristic to those problems. In particular, if the Benford sentences are embedded in a larger sequence such as L, it should use Benford probabilities on that subset.
SIA fails to achieve this sub-sequence optimality because the “objective function” is not decoupled: Bayesian updating “incentivizes” a high joint score, not a high score on individual questions. In particular the program “wants” to condition on its getting all previous questions in sequence correct.
As we’ve discussed extensively in person, this incentive gives an advantage to programs which answer with 1s and 0s deterministically rather than probabilistically. (The stochastic Turing machines will learn to act like deterministic TMs.) The programs want badly to be able to remember their previous answers, to be able to update on them. They can do this by using extra bits in their description length to “memorize” answers, rather than generating answers stochastically. This is worthwhile for the programs even as we make program-description bits more expensive (using RTM(3) rather than RTM(1)), because a memorized logical fact can be used to get correct answers on so many future logical facts. In effect, giving 0s and 1s rather than stochastic answers is such a good strategy that we cannot incentivize programs out of this behavior (without totally washing out the effect of learning).
Rather than gaining traction by giving answers with Benford probabilities, programs gain traction by using appropriate description languages in memory such that the prior on programs will assign Benford probabilities to the different extensions of a program, purely as a matter of program length. This allows SIA to give good probabilities even though the programs in its mixture distributions are learning not to do so.
Having understood this part of the problem, let’s discuss the example you give in the post.
You consider three sentences: ϕsn, ϕsA(n), and tn:=ϕsn↔ϕsA(n). We assume that these are interspersed in E, and that SIA has already been trained up to a large n on this kind of problem; we wish to show that the answers for the subsequence ϕx depend in a problematic way on the answers for subsequence tx.
The argument seems to be: suppose that the question ordering is such that ϕsn and tn are considered long before ϕsA(n). Now, when considering ϕsA(n), the programs will have a lot more time; in particular, they have time to compute the actual answer to ϕsn from scratch, and also have time to call themselves on ϕsn and tn to see what their earlier selves answered for those questions.
We note that the probability P we could independently give to tn is a specific quantity based on Benford probabilities, but not a Benford probability itself.
Now I’m going to start filling in parts of the argument which I don’t see spelled out.
We assume that when considering ϕsA(n), SIA has enough time to calculate ϕsn as an explicit training example. It turns out that this sentence is true. All programs which guessed that it was false are eliminated from the set of possible programs to use on ϕsA(n).
Now, when calling themselves to see what they said for simpler problems, the programs can potentially see that they guessed 1 for ϕsn. They can also see their guess for tn. Some of the programs will have guessed true and some false. Assume that they answer true and false with proportion P. I’ll call this assumption (*) for later reference.
The programs have enough information to deduce the answer to ϕsA(n) which is consistent with their answers so far.ϕsn is true, so they can simply check what they replied on tn and reply the same thing for ϕsA(n). This is better than guessing with Benford probability, because programs which guessed the wrong answer for tn will be eliminated later anyway. By assumption (*), we conclude that the probability SIA approaches P as n increases.
Can we justify assumption (*)? Suppose that when the program considers tn, it does not have time to look up the answer it gave for ϕsn. Then its best bet is to answer using the Benford assumption on ϕsn and ϕsA(n), resulting in the probability estimate P.
But this seems potentially unrealistic. The program has learned to memorize ϕsn. It can find what it answered in the time it takes to do a lookup. In this case, the programs are better off having guessed in Benford proportion, conditioned on ϕsn being true. (They also guess in reverse of Benford proportion for the case when ϕsn is false, but remember that (*) was an assumption specifically about the proportion conditioning on ϕsn being true.)
I believe your concern comes from the fact that at the time the program has to assign a probability to ϕsA(n), it has not only deduced the truth of ϕsn but it also earlier guessed at the truth value of ϕsn. When it guesses here, it loses some probability mass, but it can lose some of that probability mass in a way that is correlated to the answer it gave to tn. This way, it can still give the correct probability on ϕsA(n).
Here is my fix: Instead of L consider the case where we are trying to guess only sentences of the form ϕsA(n) and tn for some n. Meaning we modify L to reject any sentence not of that form. Both of these sub sequences are indistinguishable from coin flips with fixed probabilities. In this case, SIA will not get the correct probabilities on both subsequences, because it has an incentive to make its answers to ϕsA(n) match its answers to tn (not match when ϕsn is false), and any program that does not make them match will be trumped by one that will.
This does not mean that we have this property when we consider all of L, but the code in no way depends on E, and I see no reason to think that it will work for L, but not the modified L.
Now we are only considering two sentence schemas, ϕsA(n) and tn:=ϕsn↔ϕnA(n). (Also, ignore the (rare) case where n is an Ackermann number.)
I’ll call the Benford probability B:=1log10, and (as before) the tn probability P:=B2+(1−B)2.
At the time when SIA considers tn, we assume it does not give its sampled programs enough time to solve either ϕsn or ϕnA(n). (This assumption is part of the problem setup; it seems likely that cases like this cannot be ruled out by a simple rule, though.) The best thing the programs can do is treat the tn like coin flips with probability P.
At the time when ϕsA(n) is considered, the program has enough time to compute ϕsn (again as part of the problem setup). It can also remember what guess it made on tn. The best thing it can do now is to logically combine those to determine ϕsA(n). This causes it to not treat ϕsA(n) like a random coin. For cases where ϕsn=true, the population of sampled programs will guess ϕsA(n)=true with frequency approaching P. For cases where ϕsn=false, the frequency will be 1−P.
This behavior is the optimal response to the problem as given to SIA, but is suboptimal for what we actually wanted. The Bayes score of SIA on the sub-sequence consisting of only the ϕsA(n) is suboptimal. It will average out to probability B, but continue to be higher and lower for individual cases, without actually predicting those cases more effectively; SIA is acting like it thinks there is a correlation between ϕsn and ϕnA(n) when there is none. (This is especially odd considering that SIA isn’t even being asked to predict ϕsn in general, in this case!)
This is still not a proof, but it looks like it could be turned into one.
I’m hoping writing it out like this unpacks some of the mutual assumptions Scott and I share as a result of talking about this.
No we cannot justify (*). In fact, (*) will not even hold. However if (*) does not hold, I think that is just as bad as failing the Berford test. The tn sentences are themselves a sequence that is indistinguishable from coming a sequence coming from a weighted coin. Therefore failing to provide probability P to the sentences tn is a strong sign that the code will also give the wrong probability to ϕsn. The two are not qualitatively different.
A formal proof of why it fails is not written up, but if it is, the conclusion will be that either ϕsn OR tn will have incorrect limiting probabilities.
I’m not sure the counterexample given here works as stated. What follows is my attempt to go through details. (I still thing SIA is doing something wrong and is unlikely to work, but it is important to get the details right to be sure.)
As I understood it, the problem was as follows. (Most of this is just re-wording of things you said in the post.)
We wanted to design SIA such that if there is an optimal heuristic for a quickly computable sub-sequence of E, then it would learn to apply that heuristic to those problems. In particular, if the Benford sentences are embedded in a larger sequence such as L, it should use Benford probabilities on that subset.
SIA fails to achieve this sub-sequence optimality because the “objective function” is not decoupled: Bayesian updating “incentivizes” a high joint score, not a high score on individual questions. In particular the program “wants” to condition on its getting all previous questions in sequence correct.
As we’ve discussed extensively in person, this incentive gives an advantage to programs which answer with 1s and 0s deterministically rather than probabilistically. (The stochastic Turing machines will learn to act like deterministic TMs.) The programs want badly to be able to remember their previous answers, to be able to update on them. They can do this by using extra bits in their description length to “memorize” answers, rather than generating answers stochastically. This is worthwhile for the programs even as we make program-description bits more expensive (using RTM(3) rather than RTM(1)), because a memorized logical fact can be used to get correct answers on so many future logical facts. In effect, giving 0s and 1s rather than stochastic answers is such a good strategy that we cannot incentivize programs out of this behavior (without totally washing out the effect of learning).
Rather than gaining traction by giving answers with Benford probabilities, programs gain traction by using appropriate description languages in memory such that the prior on programs will assign Benford probabilities to the different extensions of a program, purely as a matter of program length. This allows SIA to give good probabilities even though the programs in its mixture distributions are learning not to do so.
Having understood this part of the problem, let’s discuss the example you give in the post.
You consider three sentences: ϕsn, ϕsA(n), and tn:=ϕsn↔ϕsA(n). We assume that these are interspersed in E, and that SIA has already been trained up to a large n on this kind of problem; we wish to show that the answers for the subsequence ϕx depend in a problematic way on the answers for subsequence tx.
The argument seems to be: suppose that the question ordering is such that ϕsn and tn are considered long before ϕsA(n). Now, when considering ϕsA(n), the programs will have a lot more time; in particular, they have time to compute the actual answer to ϕsn from scratch, and also have time to call themselves on ϕsn and tn to see what their earlier selves answered for those questions.
We note that the probability P we could independently give to tn is a specific quantity based on Benford probabilities, but not a Benford probability itself.
Now I’m going to start filling in parts of the argument which I don’t see spelled out.
We assume that when considering ϕsA(n), SIA has enough time to calculate ϕsn as an explicit training example. It turns out that this sentence is true. All programs which guessed that it was false are eliminated from the set of possible programs to use on ϕsA(n).
Now, when calling themselves to see what they said for simpler problems, the programs can potentially see that they guessed 1 for ϕsn. They can also see their guess for tn. Some of the programs will have guessed true and some false. Assume that they answer true and false with proportion P. I’ll call this assumption (*) for later reference.
The programs have enough information to deduce the answer to ϕsA(n) which is consistent with their answers so far.ϕsn is true, so they can simply check what they replied on tn and reply the same thing for ϕsA(n). This is better than guessing with Benford probability, because programs which guessed the wrong answer for tn will be eliminated later anyway. By assumption (*), we conclude that the probability SIA approaches P as n increases.
Can we justify assumption (*)? Suppose that when the program considers tn, it does not have time to look up the answer it gave for ϕsn. Then its best bet is to answer using the Benford assumption on ϕsn and ϕsA(n), resulting in the probability estimate P.
But this seems potentially unrealistic. The program has learned to memorize ϕsn. It can find what it answered in the time it takes to do a lookup. In this case, the programs are better off having guessed in Benford proportion, conditioned on ϕsn being true. (They also guess in reverse of Benford proportion for the case when ϕsn is false, but remember that (*) was an assumption specifically about the proportion conditioning on ϕsn being true.)
I believe your concern comes from the fact that at the time the program has to assign a probability to ϕsA(n), it has not only deduced the truth of ϕsn but it also earlier guessed at the truth value of ϕsn. When it guesses here, it loses some probability mass, but it can lose some of that probability mass in a way that is correlated to the answer it gave to tn. This way, it can still give the correct probability on ϕsA(n).
Here is my fix: Instead of L consider the case where we are trying to guess only sentences of the form ϕsA(n) and tn for some n. Meaning we modify L to reject any sentence not of that form. Both of these sub sequences are indistinguishable from coin flips with fixed probabilities. In this case, SIA will not get the correct probabilities on both subsequences, because it has an incentive to make its answers to ϕsA(n) match its answers to tn (not match when ϕsn is false), and any program that does not make them match will be trumped by one that will.
This does not mean that we have this property when we consider all of L, but the code in no way depends on E, and I see no reason to think that it will work for L, but not the modified L.
I agree, this version works.
To walk through it in a bit more detail:
Now we are only considering two sentence schemas, ϕsA(n) and tn:=ϕsn↔ϕnA(n). (Also, ignore the (rare) case where n is an Ackermann number.)
I’ll call the Benford probability B:=1log10, and (as before) the tn probability P:=B2+(1−B)2.
At the time when SIA considers tn, we assume it does not give its sampled programs enough time to solve either ϕsn or ϕnA(n). (This assumption is part of the problem setup; it seems likely that cases like this cannot be ruled out by a simple rule, though.) The best thing the programs can do is treat the tn like coin flips with probability P.
At the time when ϕsA(n) is considered, the program has enough time to compute ϕsn (again as part of the problem setup). It can also remember what guess it made on tn. The best thing it can do now is to logically combine those to determine ϕsA(n). This causes it to not treat ϕsA(n) like a random coin. For cases where ϕsn=true, the population of sampled programs will guess ϕsA(n)=true with frequency approaching P. For cases where ϕsn=false, the frequency will be 1−P.
This behavior is the optimal response to the problem as given to SIA, but is suboptimal for what we actually wanted. The Bayes score of SIA on the sub-sequence consisting of only the ϕsA(n) is suboptimal. It will average out to probability B, but continue to be higher and lower for individual cases, without actually predicting those cases more effectively; SIA is acting like it thinks there is a correlation between ϕsn and ϕnA(n) when there is none. (This is especially odd considering that SIA isn’t even being asked to predict ϕsn in general, in this case!)
This is still not a proof, but it looks like it could be turned into one.
I’m hoping writing it out like this unpacks some of the mutual assumptions Scott and I share as a result of talking about this.
No we cannot justify (*). In fact, (*) will not even hold. However if (*) does not hold, I think that is just as bad as failing the Berford test. The tn sentences are themselves a sequence that is indistinguishable from coming a sequence coming from a weighted coin. Therefore failing to provide probability P to the sentences tn is a strong sign that the code will also give the wrong probability to ϕsn. The two are not qualitatively different.
A formal proof of why it fails is not written up, but if it is, the conclusion will be that either ϕsn OR tn will have incorrect limiting probabilities.