David Scott Krueger (formerly: capybaralet) comments on AGI Ruin: A List of Lethalities

David Scott Krueger (formerly: capybaralet) 7 Jun 2022 16:51 UTC
LW: 18 AF: 6
−3
AF
While I share a large degree of pessimism for similar reasons, I am somewhat more optimistic overall.

Most of this comes from generic uncertainty and epistemic humility; I’m a big fan of the inside view, but it’s worth noting that this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.

However, there are some more specific points I can point to where I think you are overconfident, or at least not providing good reasons for such a high level of confidence (and to my knowledge nobody has). I’ll focus on two disagreements which I think are closest to my true disagreements.

1) I think safe pivotal “weak” acts likely do exist. It seems likely that we can access vastly superhuman capabilities without inducing huge x-risk using a variety of capability control methods. If we could build something that was only N<<infinity times smarter than us, then intuitively it seems unlikely that it would be able to reverse engineer details of the outside world or other AI systems source code (cf 35) necessary to break out of the box or start cooperating with its AI overseers. If I am right, then the reason nobody has come up with one is because they aren’t smart enough (in some—possibly quite narrow—sense of smart); that’s why we need the superhuman AI! Of course, it could also be that someone has such an idea, but isn’t sharing it publicly / with Eliezer.

2) I am not convinced that any superhuman AGI we are likely to have the technical means to build in the near future is going to be highly consequentialist (although this does seem likely). I think that humans aren’t actually that consequentialist, current AI systems even less so, and it seems entirely plausible that you don’t just automatically get super consequentialist things no matter what you are doing or how you are training them… if you train something to follow commands in a bounded way using something like supervised learning, maybe you actually end up with something that does something reasonably close to that. My main reason for expecting consequentialist systems at superhuman-but-not-superintelligent-level AGI is that people will build them that way because of competitive pressures, not because systems that people are trying to make non-consequentialist end up being consequentialist.

These two points are related: If we think consequentialism is unavoidable (RE 2), then we should be more skeptical that we can safely harness the power of superhuman capabilities at all (RE 1), although we could still hope to use capability control and incentive schemes to harness a superhuman-but-not-superintelligent consequentialist AGI to devise and help execute “weak” pivotal acts.

3) Maybe one more point worth mentioning is the “alien concepts” bit: I also suspect AIs will have alien concepts and thus generalize in weird ways. Adversarial examples and other robustness issues are evidence in favor of this, but we are also seeing that scaling makes models more robust, so it seems plausible that AGI will actually end up using similar concepts to humans, thus making generalizing in the ways we intend/expect natural for AGI systems.

---------------------------------------------------------------------
The rest of my post is sort of just picking particular places where I think the argumentation is weak, in order to illustrate why I currently think you are, on net, overconfident.
7. The reason why nobody in this community has successfully named a ‘pivotal weak act’ where you do something weak enough with an AGI to be passively safe, but powerful enough to prevent any other AGI from destroying the world a year later—and yet also we can’t just go do that right now and need to wait on AI—is that nothing like that exists.
This contains a dubious implicit assumption, namely: we cannot build safe super-human intelligence, even if it is only slightly superhuman, or superhuman in various narrow-but-strategically-relevant areas.
19. More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment—to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward.
This basically what CIRL aims to do. We can train for this sort of thing and study such methods of training empirically in synthetic settings.
23. Corrigibility is anti-natural to consequentialist reasoning
Maybe I missed it, but I didn’t see any argument for why we end up with consequentialist reasoning.
30. [...] There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.
It seems like such things are likely to exist by analogy with complexity theory (checking is easier than proposing).
36. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.
I figured it was worth noting that this part doesn’t explicitly say that relatively weak AGIs can’t perform pivotal acts.
- Rob Bensinger 8 Jun 2022 5:30 UTC
  LW: 8 AF: 2
  5
  AF Parent
  this can (roughly) be read as a set of 42 statements that need to be true for us to in fact be doomed, and statistically speaking it seems unlikely that all of these statements are true.
  I don’t think these statements all need to be true in order for p(doom) to be high, and I also don’t think they’re independent. Indeed, they seem more disjunctive than conjunctive to me; there are many cases where any one of the claims being true increases risk substantially, even if many others are false.
  - David Scott Krueger (formerly: capybaralet) 8 Jun 2022 11:27 UTC
    LW: 1 AF: 1
    0
    AF Parent
    I basically agree.
    
    I am arguing against extreme levels of pessimism (~>99% doom).