Rob Bensinger comments on AI Alignment Open Thread August 2019

Rob Bensinger 6 Aug 2019 18:30 UTC
13 points
MIRIx events are funded by MIRI, but we don’t decide the topics or anything. I haven’t taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don’t see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi and the idea of a “safety-story” in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia for better attempts to characterize what MIRI is looking for.
ETA: From the end of the latter dialogue,
In point of fact, the real reason the author is listing out this methodology is that he’s currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won’t want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access a result inside subspace Q with properties P, and I believe a space with properties P does not include any weight combinations that figure out how to kill the programmer.”
Though this itself is not really a reduced statement and still has too much goal-laden language in it.
Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
- Rob Bensinger 6 Aug 2019 18:36 UTC
  3 points
  Parent
  Also the discussion of deconfusion research in https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/ and https://www.lesswrong.com/posts/Gg9a4y8reWKtLe3Tn/the-rocket-alignment-problem , and the sketch of ‘why this looks like a hard problem in general’ in https://www.lesswrong.com/posts/zEvqFtT4AtTztfYC4/optimization-amplifies and https://arbital.com/p/aligning_adds_time/ .