Rob Bensinger comments on Six Dimensions of Operational Adequacy in AGI Projects

Rob Bensinger 3 Jun 2022 17:50 UTC
3 points
Another argument for shorter CEV timelines, is that AI itself may help complete the theory of CEV alignment.
I agree with this part. That’s why I’ve been saying ‘maybe we can do this in a few subjective decades or centuries’ rather than ‘maybe we can do this in a few subjective millennia.’ 🙂
But I’m mostly imagining AGI helping us get CEV theory faster. Which obviously requires a lot of prior alignment just to make use of the AGI safely, and to trust its outputs.
The idea is to keep ratcheting up alignment so we can safely make use of more capabilities—and then, in at least some cases, using those new capabilities to further improve and accelerate our next ratcheting-up of alignment.
Along with the traditional powers of computation—calculation, optimization, deduction, etc—language models, despite their highly uneven output, are giving us a glimpse of what it will be like, to have AI contributing even to discussions like this. That day isn’t far off at all.
… And that makes you feel optimistic about the rush-to-CEV option? ‘Unaligned AGIs or proto-AGIs generating plausble-sounding arguments about how to do CEV’ is not a scenario that makes me update toward humanity surviving.
there will be great temptations to use tool AGI to carry out interventions that have nothing to do with stopping unsafe AGI...
I share your pessimism about any group that would feel inclined to give in to those temptations, when the entire future light cone is at stake.
The scenario where we narrowly avoid paperclips by the skin of our teeth, and now have a chance to breathe and think things through before taking any major action, is indeed a fragile one in some respects, where there are many ways to rapidly destroy all of the future’s value by overextending. (E.g., using more AGI capabilities than you can currently align, or locking in contemporary human values that shouldn’t be locked in, or hastily picking the wrong theory or implementation of ‘how to do moral progress’.)
I don’t think we necessarily disagree about anything except ‘how hard is CEV’? It sounds to me like we’d mostly have the same intuitions conditional on ‘CEV is very hard’; but I take this very much for granted, so I’m freely focusing my attention on ‘OK, how could we make things go well given that fact?’.
- Mitchell_Porter 5 Jun 2022 2:24 UTC
  4 points
  Parent
  I don’t think we necessarily disagree about anything except ‘how hard is CEV’? It sounds to me like we’d mostly have the same intuitions conditional on ‘CEV is very hard’
  I disagree on the plausibility of a stop-the-world scheme. The way things are now is as safe or stable as they will ever be. I think it’s a better plan to use the rising tide of AI capabilities to flesh out CEV. In particular, since the details of CEV depend on not-yet-understood details of human cognition, one should think about how to use near-future AI power to extract those details from the available data regarding brains and behavior. But AI can contribute everywhere else in the development of CEV theory and practice too.
  But I won’t try to change your thinking. In practice, MIRI’s work on simpler forms of alignment (and its work on logical induction, decision theory paradigms, etc) is surely relevant to a “CEV now” approach too. What I am wondering is, where is the de-facto center of gravity for a “CEV now” effort? I’ve said many times that June Ku’s blueprint is the best I’ve seen, but I don’t see anyone working to refine it. And there are other people whose work seems relevant and has a promising rigor, but I’m not sure how it best fits together.
  edit: I do see a value to non-CEV alignment work, distinct from figuring out how to stop the world safely: and that is to reduce the risk arising from the general usage of advanced AI systems. So it is a contribution to AI safety.