Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy 2 Apr 2021 17:03 UTC
LW: 4 AF: 3
0
AF
Is $\frac{ϵ}{δ}$ bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we’d have trouble.

Yes, you’re right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the “bridge rules” by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.

Here’s the sketch of a proposal how to solve this. Let’s construct our prior to be the convolution of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that’s sampled as follows:
- First, sample a hypothesis $h$ from the Solomonoff prior
- Second, choose a number $n$ according to some simple distribution with high expected value (e.g. $n^{- 1 - α}$ ) with $α ≪ 1$
- Third, sample a DFA $A$ with $n$ states and a uniformly random transition table
- Fourth, apply $A$ to the output of $h$
We think of the simplicity prior as choosing “physics” (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing “bridge rules” (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of $O (1)$ , however the source of our trouble is also “merely” a factor of $O (1)$ .

Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the $α \to 0$ limit).
What links here?
- Vanessa Kosoy's comment on Formal Solution to the Inner Alignment Problem by michaelcohen (20 Mar 2021 18:43 UTC; 2 points)
- paulfchristiano 5 Apr 2021 19:47 UTC
  LW: 8 AF: 6
  0
  AF Parent
  I broadly think of this approach as “try to write down the ‘right’ universal prior.” I don’t think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of “programming language” to use and so on.
  I’m kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough that $\frac{ε}{δ}$ is manageable but I think I still find it scary (and don’t totally remember all my sources of concern).
  I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior. That is, if someone inside one of our hypotheses has noticed that e.g. a certain class of decisions is more important and so they will simulate only those situations, then we should also notice this and by the same token care more about our decision if we are in one of those situations (rather than using a universal prior without importance weighting). My sense is that without competitiveness we are in trouble anyway on other fronts, and so it is probably also reasonable to think of as a first-line defense against this kind of issue.
  We think of the simplicity prior as choosing “physics” (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing “bridge rules” (which we expect to have low computational complexity but possibly high description complexity).
  This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with “giant” universes that do all the possible computations you would want, and then using the “free” complexity in the bridge rules to pick which of the computations you actually wanted. I am not sure if the DFA proposal gets around this kind of problem though it sounds like it would be pretty similar.
  - Vanessa Kosoy 6 Apr 2021 21:11 UTC
    LW: 2 AF: 1
    0
    AF Parent
    
    I’m kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.
    
    I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both $ϵ$ and $δ$ in a similar way.
    
    More generally, I guess I’m more optimistic than you about solving all such philosophical liabilities.
    
    I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior.
    
    I don’t understand the proposal. Is there a link I should read?
    
    This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with “giant” universes that do all the possible computations you would want, and then using the “free” complexity in the bridge rules to pick which of the computations you actually wanted.
    
    So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not “free complexity” because it’s not coming from a simplicity prior at all. For a program of length $n$ , you need a particular DFA of size $Ω (n)$ . However, the actual DFA is of expected size $m$ with $m ≫ n$ . The probability of having the DFA you need embedded in that is something like $\frac{m!}{(m - n)!} m^{- 2 n} \approx m^{- n} ≪ 2^{- n}$ . So moving everything to the bridge makes a much less likely hypothesis.