Vanessa Kosoy comments on Formal Solution to the Inner Alignment Problem

Vanessa Kosoy Mar 20, 2021, 6:43 PM
LW: 2 AF: 1
AF

Is $\frac{ϵ}{δ}$ bounded? I assign significant probability to it being $2^{100}$ or more, as mentioned in the other thread between me and Michael Cohen, in which case we’d have trouble.

I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don’t. ~~In the latter case, $\frac{ϵ}{δ}$ shouldn’t be large.~~ In the former case, it means that we are overwhelming likely to actually be inside a malign simulation. But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)

[EDIT: I was wrong, see this.]

It feels like searching for a neural network is analogous to searching for a MAP estimate, and that more generally efficient algorithms are likely to just run one hypothesis most of the time rather than running them all.

Probably efficient algorithms are not running literally all hypotheses, but, they can probably consider multiple plausible hypotheses. In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it’s attacking). Currently I can only speculate about neural networks, but I do hope we’ll have competitive algorithms amenable to theoretical analysis, whether they are neural networks or not.

As you mention, is it safe to wait and defer or are we likely to have a correlated failure in which all the aligned systems block simultaneously? (e.g. as described here)

I think that the problem you describe in the linked post can be delegated to the AI. That is, instead of controlling trillions of robots via counterfactual oversight, we will start with just one AI project that will research how to organize the world. This project would top any solution we can come up with ourselves.
- paulfchristiano Mar 20, 2021, 10:21 PM
  LW: 5 AF: 5
  AF Parent
  I think that there are roughly two possibilities: either the laws of our universe happen to be strongly compressible when packed into a malign simulation hypothesis, or they don’t. In the latter case, $\frac{ϵ}{δ}$ shouldn’t be large. In the former case, it means that we are overwhelming likely to actually be inside a malign simulation.
  It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn’t mean that I think reality probably works that way. So I don’t see how to salvage this kind of argument.
  But, then AI risk is the least of our troubles. (In particular, because the simulation will probably be turned off once the attack-relevant part is over.)
  It seems to me like this requires a very strong match between the priors we write down and our real priors. I’m kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously “wrong” universal prior).
  In particular, the malign hypothesis itself is an efficient algorithm and it is somehow aware of the two different hypotheses (itself and the universe it’s attacking).
  Do we have any idea how to write down such an algorithm though? Even granting that the malign hypothesis does so it’s not clear how we would (short of being fully epistemically competitive); but moreover it’s not clear to me the malign hypothesis faces a similar version of this problem since it’s just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them, and beyond that it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.
  - Vanessa Kosoy Mar 22, 2021, 12:04 AM
    LW: 3 AF: 2
    AF Parent
    
    It seems like the simplest algorithm that makes good predictions and runs on your computer is going to involve e.g. reasoning about what aspects of reality are important to making good predictions and then attending to those. But that doesn’t mean that I think reality probably works that way. So I don’t see how to salvage this kind of argument.
    
    I think it works differently. What you should get is an infra-Bayesian hypothesis which models only those parts of reality that can be modeled within the given computing resources. More generally, if you don’t endorse the predictions of the prediction algorithm than either you are wrong or you should use a different prediction algorithm.
    
    How the can the laws of physics be extra-compressible within the context of a simulation hypothesis? More compression means more explanatory power. I think that is must look something like, we can use the simulation hypothesis to predict the values of some of the physical constants. But, it would require a very unlikely coincidence for physical constants to have such values unless we are actually in a simulation.
    
    It seems to me like this requires a very strong match between the priors we write down and our real priors. I’m kind of skeptical about that a priori, but then in particular we can see lots of ways in which attackers will be exploiting failures in the prior we write down (e.g. failure to update on logical facts they observe during evolution, failure to make the proper anthropic update, and our lack of philosophical sophistication meaning that we write down some obviously “wrong” universal prior).
    
    I agree that we won’t have a perfect match but I think we can get a “good enough” match (similarly to how any two UTMs that are not too crazy give similar Solomonoff measures.) I think that infra-Bayesianism solves a lot of philosophical confusions, including anthropics and logical uncertainty, although some of the details still need to be worked out. (But, I’m not sure what specifically do you mean by “logical facts they observe during evolution”?) Ofc this doesn’t mean I am already able to fully specify the correct infra-prior: I think that would take us most of the way to AGI.
    
    Do we have any idea how to write down such an algorithm though?
    
    I have all sorts of ideas, but still nowhere near the solution ofc. We can do deep learning while randomizing initial conditions and/or adding some noise to gradient descent (e.g. simulated annealing), producing a population of networks that progresses in an evolutionary way. We can, for each prediction, train a model that produces the opposite prediction and compare it to the default model in terms of convergence time and/or weight magnitudes. We can search for the algorithm using meta-learning. We can do variational Bayes with a “multi-modal” model space: mixtures of some “base” type of model. We can do progressive refinement of infra-Bayesian hypotheses, s.t. the plausible hypotheses at any given moment are the leaves of some tree.
    
    moreover it’s not clear to me the malign hypothesis faces a similar version of this problem since it’s just thinking about a small list of hypotheses rather than trying to maintain a broad enough distribution to find all of them
    
    Well, we also don’t have to find all of them: we just have to make sure we don’t miss the true one. So, we need some kind of transitivity: if we find a hypothesis which itself finds another hypothesis (in some sense) then we also find the other hypothesis. I don’t know how to prove such a principle, but it doesn’t seem implausible that we can.
    
    it may just be reasoning deductively about properties of the space of hypotheses rather than using a simple algorithm we can write down.
    
    Why do you think “reasoning deductively” implies there is no simple algorithm? In fact, I think infra-Bayesian logic might be just the thing to combine deductive and inductive reasoning.