Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy Feb 26, 2021, 2:10 PM
LW: 10 AF: 8
AF
Here’s the sketch of a solution to the query complexity problem.

Simplifying technical assumptions:
- The action space is ${0, 1}$
- All hypotheses are deterministic
- Predictors output maximum likelihood predictions instead of sampling the posterior
I’m pretty sure removing those is mostly just a technical complication.

Safety assumptions:
- The real hypothesis has prior probability lower bounded by some known quantity $δ$ , so we discard all hypotheses of probability less than $δ$ from the onset.
- Malign hypotheses have total prior probability mass upper bounded by some known quantity $ϵ$ (determining parameters $δ$ and $ϵ$ is not easy, but I’m pretty sure any alignment protocol will have parameters of some such sort.)
- Any predictor that uses a subset of the prior without malign hypotheses is safe (I would like some formal justification for this, but seems plausible on the face of it.)
- Given any safe behavior, querying the user instead of producing a prediction cannot make it unsafe (this can and should be questioned but let it slide for now.)
Algorithm: On each round, let $p_{0}$ be the prior probability mass of unfalsified hypotheses predicting $0$ and $p_{1}$ be the same for $1$ .
- If $p_{0} > p_{1} + ϵ$ or $p_{1} = 0$ , output $0$
- If $p_{1} > p_{0} + ϵ$ or $p_{0} = 0$ , output $1$
- Otherwise, query the user
Analysis: As long as $p_{0} + p_{1} ≫ ϵ$ , on each round we query, $p_{0} + p_{1}$ is halved. Therefore there can only be roughly $O (log \frac{1}{ϵ})$ such rounds. On the following rounds, each round we query at least one hypothesis is removed, so there can only be roughly $O (\frac{ϵ}{δ})$ such rounds. The total query complexity is therefore approximately $O (log \frac{1}{ϵ} + \frac{ϵ}{δ})$ .
What links here?
- Vanessa Kosoy's comment on Formal Solution to the Inner Alignment Problem by michaelcohen (Feb 25, 2021, 7:52 PM; 2 points)
- paulfchristiano Mar 19, 2021, 10:12 PM
  LW: 8 AF: 6
  0
  AF Parent
  I agree that this settles the query complexity question for Bayesian predictors and deterministic humans.
  I expect it can be generalized to have complexity $O (\frac{ϵ^{2}}{δ^{2}})$ in the case with stochastic humans where treacherous behavior can take the form of small stochastic shifts.
  I think that the big open problems for this kind of approach to inner alignment are:
  - Is $\frac{ϵ}{δ}$ bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we’d have trouble. (I believe this is also Eliezer’s view.)
  - It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all. Then this algorithm increases the cost of inference by $\frac{ϵ}{δ}$ which could be a big problem.
  - As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here)
  - Vanessa Kosoy Apr 2, 2021, 5:03 PM
    LW: 4 AF: 3
    AF Parent
    Is $\frac{ϵ}{δ}$ bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we’d have trouble.
    
    Yes, you’re right. A malign simulation hypothesis can be a very powerful explanation to the AI for the why it found itself at a point suitable for this attack, thereby compressing the “bridge rules” by a lot. I believe you argued as much in your previous writing, but I managed to confuse myself about this.
    
    Here’s the sketch of a proposal how to solve this. Let’s construct our prior to be the convolution of a simplicity prior with a computational easiness prior. As an illustration, we can imagine a prior that’s sampled as follows:
    
    First, sample a hypothesis $h$ from the Solomonoff prior
    Second, choose a number $n$ according to some simple distribution with high expected value (e.g. $n^{- 1 - α}$ ) with $α ≪ 1$
    Third, sample a DFA $A$ with $n$ states and a uniformly random transition table
    Fourth, apply $A$ to the output of $h$
    
    We think of the simplicity prior as choosing “physics” (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing “bridge rules” (which we expect to have low computational complexity but possibly high description complexity). Ofc this convolution can be regarded as another sort of simplicity prior, so it differs from the original simplicity prior merely by a factor of $O (1)$ , however the source of our trouble is also “merely” a factor of $O (1)$ .
    
    Now the simulation hypothesis no longer has an advantage via the bridge rules, since the bridge rules have a large constant budget allocated to them anyway. I think it should be possible to make this into some kind of theorem (two agents with this prior in the same universe that have access to roughly the same information should have similar posteriors, in the $α \to 0$ limit).
    What links here?
    Vanessa Kosoy's comment on Formal Solution to the Inner Alignment Problem by michaelcohen (Mar 20, 2021, 6:43 PM; 2 points)
    - paulfchristiano Apr 5, 2021, 7:47 PM
      LW: 8 AF: 6
      AF Parent
      I broadly think of this approach as “try to write down the ‘right’ universal prior.” I don’t think the bridge rules / importance-weighting consideration is the only way in which our universal prior is predictably bad. There are also issues like anthropic update and philosophical considerations about what kind of “programming language” to use and so on.
      I’m kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit. I guess you just need to get close enough that $\frac{ε}{δ}$ is manageable but I think I still find it scary (and don’t totally remember all my sources of concern).
      I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior. That is, if someone inside one of our hypotheses has noticed that e.g. a certain class of decisions is more important and so they will simulate only those situations, then we should also notice this and by the same token care more about our decision if we are in one of those situations (rather than using a universal prior without importance weighting). My sense is that without competitiveness we are in trouble anyway on other fronts, and so it is probably also reasonable to think of as a first-line defense against this kind of issue.
      We think of the simplicity prior as choosing “physics” (which we expect to have low description complexity but possibly high computational complexity) and the easiness prior as choosing “bridge rules” (which we expect to have low computational complexity but possibly high description complexity).
      This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with “giant” universes that do all the possible computations you would want, and then using the “free” complexity in the bridge rules to pick which of the computations you actually wanted. I am not sure if the DFA proposal gets around this kind of problem though it sounds like it would be pretty similar.
      - Vanessa Kosoy Apr 6, 2021, 9:11 PM
        LW: 2 AF: 1
        AF Parent
        
        I’m kind of scared of this approach because I feel unless you really nail everything there is going to be a gap that an attacker can exploit.
        
        I think that not every gap is exploitable. For most types of biases in the prior, it would only promote simulation hypotheses with baseline universes conformant to this bias, and attackers who evolved in such universes will also tend to share this bias, so they will target universes conformant to this bias and that would make them less competitive with the true hypothesis. In other words, most types of bias affect both $ϵ$ and $δ$ in a similar way.
        
        More generally, I guess I’m more optimistic than you about solving all such philosophical liabilities.
        
        I think of this in contrast with my approach based on epistemic competitiveness approach, where the idea is not necessarily to identify these considerations in advance, but to be epistemically competitive with an attacker (inside one of your hypotheses) who has noticed an improvement over your prior.
        
        I don’t understand the proposal. Is there a link I should read?
        
        This is very similar to what I first thought about when going down this line. My instantiation runs into trouble with “giant” universes that do all the possible computations you would want, and then using the “free” complexity in the bridge rules to pick which of the computations you actually wanted.
        
        So, you can let your physics be a dovetailing of all possible programs, and delegate to the bridge rule the task of filtering the outputs of only one program. But the bridge rule is not “free complexity” because it’s not coming from a simplicity prior at all. For a program of length $n$ , you need a particular DFA of size $Ω (n)$ . However, the actual DFA is of expected size $m$ with $m ≫ n$ . The probability of having the DFA you need embedded in that is something like $\frac{m!}{(m - n)!} m^{- 2 n} \approx m^{- n} ≪ 2^{- n}$ . So moving everything to the bridge makes a much less likely hypothesis.
  - Vanessa Kosoy Mar 20, 2021, 6:43 PM
    LW: 2 AF: 1
    AF Parent
    
    Is $\frac{ϵ}{δ}$ bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we’d have trouble.
    
    I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don’t. ~~In the latter case, $\frac{ϵ}{δ}$ shouldn’t be large.~~ In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)
    
    [EDIT: I was wrong, see this.]
    
    It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all.
    
    Probably efficient algorithms are not running literally all hypotheses, but, they can probably consider multiple plausible hypotheses. In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it’s attacking). Currently I can only speculate about neural networks, but I do hope we’ll have competitive algorithms amenable to theoretical analysis, whether they are neural networks or not.
    
    As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here)
    
    I think that the problem you describe in the linked post can be delegated to the AI. That is, instead of controlling trillions of robots via counterfactual oversight, we will start with just one AI project that will research how to organize the world. This project would top any solution we can come up with ourselves.
    - paulfchristiano Mar 20, 2021, 10:21 PM
      LW: 5 AF: 5
      AF Parent
      I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don’t. In the latter case, $\frac{ϵ}{δ}$ shouldn’t be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation.
      It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn’t mean that I think reality probably works that way. So I don’t see how to salvage this kind of argument.
      But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)
      It seems to me like this requires a very strong match between the priors we write down and our real priors. I’m kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously “wrong” universal prior).
      In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it’s attacking).
      Do we have any idea how to write down such an algorithm though? Even granting that the malign hypothesis does so it’s not clear how we would (short of being fully epistemically competitive); but moreover it’s not clear to me the malign hypothesis faces a similar version of this problem since it’s just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them, and beyond that it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.
      - Vanessa Kosoy Mar 22, 2021, 12:04 AM
        LW: 3 AF: 2
        AF Parent
        
        It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn’t mean that I think reality probably works that way. So I don’t see how to salvage this kind of argument.
        
        I think it works differently. What you should get is an infra-Bayesian hypothesis which models only those parts of reality that can be modeled within the given computing resources. More generally, if you don’t endorse the predictions of the prediction algorithm than either you are wrong or you should use a different prediction algorithm.
        
        How the can the laws of physics be extra-compressible within the context of a simulation hypothesis? More compression means more explanatory power. I think that is must look something like, we can use the simulation hypothesis to predict the values of some of the physical constants. But, it would require a very unlikely coincidence for physical constants to have such values unless we are actually in a simulation.
        
        It seems to me like this requires a very strong match between the priors we write down and our real priors. I’m kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously “wrong” universal prior).
        
        I agree that we won’t have a perfect match but I think we can get a “good enough” match (similarly to how any two UTMs that are not too crazy give similar Solomonoff measures.) I think that infra-Bayesianism solves a lot of philosophical confusions, including anthropics and logical uncertainty, although some of the details still need to be worked out. (But, I’m not sure what specifically do you mean by “logical facts they observe during evolution”?) Ofc this doesn’t mean I am already able to fully specify the correct infra-prior: I think that would take us most of the way to AGI.
        
        Do we have any idea how to write down such an algorithm though?
        
        I have all sorts of ideas, but still nowhere near the solution ofc. We can do deep learning while randomizing initial conditions and/or adding some noise to gradient descent (e.g. simulated annealing), producing a population of networks that progresses in an evolutionary way. We can, for each prediction, train a model that produces the opposite prediction and compare it to the default model in terms of convergence time and/or weight magnitudes. We can search for the algorithm using meta-learning. We can do variational Bayes with a “multi-modal” model space: mixtures of some “base” type of model. We can do progressive refinement of infra-Bayesian hypotheses, s.t. the plausible hypotheses at any given moment are the leaves of some tree.
        
        moreover it’s not clear to me the malign hypothesis faces a similar version of this problem since it’s just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them
        
        Well, we also don’t have to find all of them: we just have to make sure we don’t miss the true one. So, we need some kind of transitivity: if we find a hypothesis which itself finds another hypothesis (in some sense) then we also find the other hypothesis. I don’t know how to prove such a principle, but it doesn’t seem implausible that we can.
        
        it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.
        
        Why do you think “reasoning deductively” implies there is no simple algorithm? In fact, I think infra-Bayesian logic might be just the thing to combine deductive and inductive reasoning.
- michaelcohen Feb 26, 2021, 3:15 PM
  LW: 4 AF: 3
  AF Parent
  This is very nice and short!
  And to state what you left implicit:
  If $p_{0} > p_{1} + ε$ , then in the setting with no malign hypotheses (which you assume to be safe), 0 is definitely the output, since the malign models can only shift the outcome by $ε$ , so we assume it is safe to output 0. And likewise with outputting 1.
  I’m pretty sure removing those is mostly just a technical complication
  One general worry I have about assuming that the deterministic case extends easily to the stochastic case is that a sequence of probabilities that tends to 0 can still have an infinite sum, which is not true when probabilities must $\in {0, 1}$ , and this sometimes causes trouble. I’m not sure this would raise any issues here—just registering a slightly differing intuition.