abramdemski comments on Asymptotic Logical Uncertainty: Solomonoff Induction Inspired Approach

abramdemski 4 Jul 2015 0:18 UTC
LW: 3 AF: 2
AF
I’m not sure the counterexample given here works as stated. What follows is my attempt to go through details. (I still thing SIA is doing something wrong and is unlikely to work, but it is important to get the details right to be sure.)

As I understood it, the problem was as follows. (Most of this is just re-wording of things you said in the post.)

We wanted to design SIA such that if there is an optimal heuristic for a quickly computable sub-sequence of $E$ , then it would learn to apply that heuristic to those problems. In particular, if the Benford sentences are embedded in a larger sequence such as $L$ , it should use Benford probabilities on that subset.

SIA fails to achieve this sub-sequence optimality because the “objective function” is not decoupled: Bayesian updating “incentivizes” a high joint score, not a high score on individual questions. In particular the program “wants” to condition on its getting all previous questions in sequence correct.

As we’ve discussed extensively in person, this incentive gives an advantage to programs which answer with 1s and 0s deterministically rather than probabilistically. (The stochastic Turing machines will learn to act like deterministic TMs.) The programs want badly to be able to remember their previous answers, to be able to update on them. They can do this by using extra bits in their description length to “memorize” answers, rather than generating answers stochastically. This is worthwhile for the programs even as we make program-description bits more expensive (using $R T M (3)$ rather than $R T M (1)$ ), because a memorized logical fact can be used to get correct answers on so many future logical facts. In effect, giving 0s and 1s rather than stochastic answers is such a good strategy that we cannot incentivize programs out of this behavior (without totally washing out the effect of learning).

Rather than gaining traction by giving answers with Benford probabilities, programs gain traction by using appropriate description languages in memory such that the prior on programs will assign Benford probabilities to the different extensions of a program, purely as a matter of program length. This allows $S I A$ to give good probabilities even though the programs in its mixture distributions are learning not to do so.

Having understood this part of the problem, let’s discuss the example you give in the post.

You consider three sentences: $ϕ_{s_{n}}$ , $ϕ_{s_{A (n)}}$ , and $t_{n} := ϕ_{s_{n}} \leftrightarrow ϕ_{s_{A (n)}}$ . We assume that these are interspersed in $E$ , and that $S I A$ has already been trained up to a large $n$ on this kind of problem; we wish to show that the answers for the subsequence $ϕ_{x}$ depend in a problematic way on the answers for subsequence $t_{x}$ .

The argument seems to be: suppose that the question ordering is such that $ϕ_{s_{n}}$ and $t_{n}$ are considered long before $ϕ_{s_{A (n)}}$ . Now, when considering $ϕ_{s_{A (n)}}$ , the programs will have a lot more time; in particular, they have time to compute the actual answer to $ϕ_{s_{n}}$ from scratch, and also have time to call themselves on $ϕ_{s_{n}}$ and $t_{n}$ to see what their earlier selves answered for those questions.

We note that the probability $P$ we could independently give to $t_{n}$ is a specific quantity based on Benford probabilities, but not a Benford probability itself.

Now I’m going to start filling in parts of the argument which I don’t see spelled out.

We assume that when considering $ϕ_{s_{A (n)}}$ , $S I A$ has enough time to calculate $ϕ_{s_{n}}$ as an explicit training example. It turns out that this sentence is true. All programs which guessed that it was false are eliminated from the set of possible programs to use on $ϕ_{s_{A (n)}}$ .

Now, when calling themselves to see what they said for simpler problems, the programs can potentially see that they guessed 1 for $ϕ_{s_{n}}$ . They can also see their guess for $t_{n}$ . Some of the programs will have guessed true and some false. Assume that they answer true and false with proportion $P$ . I’ll call this assumption (*) for later reference.

The programs have enough information to deduce the answer to $ϕ_{s_{A (n)}}$ which is consistent with their answers so far. $ϕ_{s_{n}}$ is true, so they can simply check what they replied on $t_{n}$ and reply the same thing for $ϕ_{s_{A (n)}}$ . This is better than guessing with Benford probability, because programs which guessed the wrong answer for $t_{n}$ will be eliminated later anyway. By assumption (*), we conclude that the probability $S I A$ approaches $P$ as $n$ increases.

Can we justify assumption (*)? Suppose that when the program considers $t_{n}$ , it does not have time to look up the answer it gave for $ϕ_{s_{n}}$ . Then its best bet is to answer using the Benford assumption on $ϕ_{s_{n}}$ and $ϕ_{s_{A (n)}}$ , resulting in the probability estimate $P$ .

But this seems potentially unrealistic. The program has learned to memorize $ϕ_{s_{n}}$ . It can find what it answered in the time it takes to do a lookup. In this case, the programs are better off having guessed in Benford proportion, conditioned on $ϕ_{s_{n}}$ being true. (They also guess in reverse of Benford proportion for the case when $ϕ_{s_{n}}$ is false, but remember that (*) was an assumption specifically about the proportion conditioning on $ϕ_{s_{n}}$ being true.)
- Scott Garrabrant 4 Jul 2015 21:15 UTC
  LW: 1 AF: 1
  AF Parent
  I believe your concern comes from the fact that at the time the program has to assign a probability to $ϕ_{s_{A (n)}}$ , it has not only deduced the truth of $ϕ_{s_{n}}$ but it also earlier guessed at the truth value of $ϕ_{s_{n}}$ . When it guesses here, it loses some probability mass, but it can lose some of that probability mass in a way that is correlated to the answer it gave to $t_{n}$ . This way, it can still give the correct probability on $ϕ_{s_{A (n)}}$ .
  
  Here is my fix: Instead of $L$ consider the case where we are trying to guess only sentences of the form $ϕ_{s_{A (n)}}$ and $t_{n}$ for some $n$ . Meaning we modify $L$ to reject any sentence not of that form. Both of these sub sequences are indistinguishable from coin flips with fixed probabilities. In this case, SIA will not get the correct probabilities on both subsequences, because it has an incentive to make its answers to $ϕ_{s_{A (n)}}$ match its answers to $t_{n}$ (not match when $ϕ_{s_{n}}$ is false), and any program that does not make them match will be trumped by one that will.
  
  This does not mean that we have this property when we consider all of $L$ , but the code in no way depends on $E$ , and I see no reason to think that it will work for $L$ , but not the modified $L$ .
  - abramdemski 4 Jul 2015 23:26 UTC
    LW: 1 AF: 1
    AF Parent
    I agree, this version works.
    
    To walk through it in a bit more detail:
    
    Now we are only considering two sentence schemas, $ϕ_{s_{A (n)}}$ and $t_{n} := ϕ_{s_{n}} \leftrightarrow ϕ_{n_{A (n)}}$ . (Also, ignore the (rare) case where $n$ is an Ackermann number.)
    
    I’ll call the Benford probability $B := \frac{1}{log 10}$ , and (as before) the $t_{n}$ probability $P := B^{2} + (1 - B)^{2}$ .
    
    At the time when $S I A$ considers $t_{n}$ , we assume it does not give its sampled programs enough time to solve either $ϕ_{s_{n}}$ or $ϕ_{n_{A (n)}}$ . (This assumption is part of the problem setup; it seems likely that cases like this cannot be ruled out by a simple rule, though.) The best thing the programs can do is treat the $t_{n}$ like coin flips with probability $P$ .
    
    At the time when $ϕ_{s_{A (n)}}$ is considered, the program has enough time to compute $ϕ_{s_{n}}$ (again as part of the problem setup). It can also remember what guess it made on $t_{n}$ . The best thing it can do now is to logically combine those to determine $ϕ_{s_{A (n)}}$ . This causes it to not treat $ϕ_{s_{A (n)}}$ like a random coin. For cases where $ϕ_{s_{n}} = t r u e$ , the population of sampled programs will guess $ϕ_{s_{A (n)}} = t r u e$ with frequency approaching $P$ . For cases where $ϕ_{s_{n}} = f a l s e$ , the frequency will be $1 - P$ .
    
    This behavior is the optimal response to the problem as given to $S I A$ , but is suboptimal for what we actually wanted. The Bayes score of $S I A$ on the sub-sequence consisting of only the $ϕ_{s_{A (n)}}$ is suboptimal. It will average out to probability $B$ , but continue to be higher and lower for individual cases, without actually predicting those cases more effectively; $S I A$ is acting like it thinks there is a correlation between $ϕ_{s_{n}}$ and $ϕ_{n_{A (n)}}$ when there is none. (This is especially odd considering that $S I A$ isn’t even being asked to predict $ϕ_{s_{n}}$ in general, in this case!)
    
    This is still not a proof, but it looks like it could be turned into one.
    
    I’m hoping writing it out like this unpacks some of the mutual assumptions Scott and I share as a result of talking about this.
- Scott Garrabrant 4 Jul 2015 22:09 UTC
  0 points
  AF Parent
  No we cannot justify (*). In fact, (*) will not even hold. However if (*) does not hold, I think that is just as bad as failing the Berford test. The $t_{n}$ sentences are themselves a sequence that is indistinguishable from coming a sequence coming from a weighted coin. Therefore failing to provide probability $P$ to the sentences $t_{n}$ is a strong sign that the code will also give the wrong probability to $ϕ_{s_{n}}$ . The two are not qualitatively different.
  
  A formal proof of why it fails is not written up, but if it is, the conclusion will be that either $ϕ_{s_{n}}$ OR $t_{n}$ will have incorrect limiting probabilities.