FactorialCode comments on AI Alignment Open Thread August 2019

FactorialCode 6 Aug 2019 12:40 UTC
7 points
At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.

From Safety Engineering for Artificial General Intelligence:

An AI built in the Artificial General Intelligence paradigm, in which the design is engineered de novo, has the advantage over humans with respect to transparency of disposition, since it is able to display its source code, which can then be reviewed for trustworthiness (Salamon, Rayhawk, and Kramár 2010; Sotala 2012). Indeed, with an improved intelligence, it might find a way to formally prove its benevolence. If weak early AIs are incentivized to adopt verifiably or even provably benevolent dispositions, these can be continually verified or proved and thus retained, even as the AIs gain in intelligence and eventually reach the point where they have the power to renege without retaliation (Hall 2007a).

Also, from section 2 of Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda:

When constructing intelligent systems which learn and interact with all the complexities of reality, it is not sufficient to verify that the algorithm behaves well in test settings. Additional work is necessary to verify that the system will continue working as intended in application. This is especially true of systems possessing general intelligence at or above the human level: superintelligent machines might find strategies and execute plans beyond both the experience and imagination of the programmers, making the clever oscillator of Bird and Layzell look trite. At the same time, unpredictable behavior from smarter-than-human systems could cause catastrophic damage, if they are not aligned with human interests (Yudkowsky 2008). Because the stakes are so high, testing combined with a gut-level intuition that the system will continue to work outside the test environment is insufficient, even if the testing is extensive. It is important to also have a formal understanding of precisely why the system is expected to behave well in application. What constitutes a formal understanding? It seems essential to us to have both (1) an understanding of precisely what problem the system is intended to solve; and (2) an understanding of precisely why this practical system is expected to solve that abstract problem. The latter must wait for the development of practical smarter than-human systems, but the former is a theoretical research problem that we can already examine.

I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
- Rob Bensinger 6 Aug 2019 18:30 UTC
  13 points
  Parent
  MIRIx events are funded by MIRI, but we don’t decide the topics or anything. I haven’t taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don’t see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi and the idea of a “safety-story” in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia for better attempts to characterize what MIRI is looking for.
  ETA: From the end of the latter dialogue,
  In point of fact, the real reason the author is listing out this methodology is that he’s currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won’t want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access a result inside subspace Q with properties P, and I believe a space with properties P does not include any weight combinations that figure out how to kill the programmer.”
  Though this itself is not really a reduced statement and still has too much goal-laden language in it.
  Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
  - Rob Bensinger 6 Aug 2019 18:36 UTC
    3 points
    Parent
    Also the discussion of deconfusion research in https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/ and https://www.lesswrong.com/posts/Gg9a4y8reWKtLe3Tn/the-rocket-alignment-problem , and the sketch of ‘why this looks like a hard problem in general’ in https://www.lesswrong.com/posts/zEvqFtT4AtTztfYC4/optimization-amplifies and https://arbital.com/p/aligning_adds_time/ .