Vanessa Kosoy comments on AI Alignment Open Thread August 2019

Vanessa Kosoy 9 Aug 2019 17:02 UTC
LW: 8 AF: 4
AF
a) I believe a weaker version of the empirical claim, namely that the catastrophe is not nearly inevitable but not unlikely. That is, I can imagine different worlds in which the probability of the catastrophe is different, and I have uncertainty over in which world we actually are, s.t. in average the probability is sizable.

b) I think that the argument you gave is sort of correct. We need to augment it by: the minimal requirement from the AI is, it needs to effectively block all competing dangerous AI projects, without also doing bad things (which is why you can’t just give it the zero utility function). Your counterargument seems weak to me because, moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there). I think that whatever your AI is, given that is satisfies the minimal requirement, some kind of utility-maximization-like behavior is likely to arise.

Coming at it from a different angle, complicated systems often fail in unexpected ways. The way people solve this problem in practice is by a combination of mathematical analysis and empirical research. I don’t think we have many examples of complicated systems where all failures were avoided by informal reasoning without either empirical or mathematical backing. In the case of superintelligent AI, empirical research alone is insufficient because, without mathematical models, we don’t know how to extrapolate empirical results from current AIs to superintelligent AIs, and when superintelligent algorithms are already here it will probably be too late.

c) I think what we can (and should) realistically aim for is, having a mathematical theory of AI, and having a mathematical model of our particular AI, such that in this model we can prove the AI is safe. This model will have some assumptions and parameters that will need to be verified/measured in other ways, through some combination of (i) experiments with AI/algorithms (ii) learning from neuroscience (iii) learning from biological evolution and (iv) leveraging our knowledge of physics. Then, there is also the question of, how precise is the correspondence between the model and the actual code (and hardware). Ideally, we want to do formal verification in which we can test that a certain theorem holds for the actual code we are running. Weaker levels of correspondence might still be sufficient, but that would be Plan B.

Also, the proof can rely on mathematical conjectures in which we have high confidence, such as $P \neq N P$ . Of course, the evidence for such conjectures is (some sort of) empirical, but it is important that the conjecture is at least a rigorous, well defined mathematical statement.
- Rohin Shah 9 Aug 2019 22:26 UTC
  LW: 2 AF: 1
  AF Parent
  I agree with a). c) seems to me to be very optimistic, but that’s mostly an intuition, I don’t have a strong argument against it (and I wouldn’t discourage people who are enthusiastic about it from working on it).
  The argument in b) makes sense; I think the part that I disagree with is:
  moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there).
  The counterargument is “current AI systems don’t look like long term planners”, but of course it is possible to respond to that with “AGI will be very different from current AI systems”, and then I have nothing to say beyond “I think AGI will be like current AI systems”.
  - Vanessa Kosoy 10 Aug 2019 10:52 UTC
    LW: 4 AF: 2
    AF Parent
    Well, any system that satisfies the Minimal Requirement is doing long term planning on some level. For example, if your AI is approval directed, it still needs to learn how to make good plans that will be approved. Once your system has a superhuman capability of producing plans somewhere inside, you should worry about that capability being applied in the wrong direction (in particular due to mesa-optimization / daemons). Also, even without long term planning, extreme optimization is dangerous (for example an approval directed AI might create some kind of memetic supervirus).
    
    But, I agree that these arguments are not enough to be confident of the strong empirical claim.