Wei Dai comments on AI Alignment Open Thread August 2019

Wei Dai 6 Aug 2019 10:52 UTC
LW: 10 AF: 4
0
AF
Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I’ve found to a statement of the empirical claim is from Eliezer’s 2012 Reply to Holden on ‘Tool AI’:

I’ve repeatedly said that the idea behind proving determinism of self-modification isn’t that this guarantees safety, but that if you prove the self-modification stable the AI might work, whereas if you try to get by with no proofs at all, doom is guaranteed.

Paul Christiano argued against this at length in Stable self-improvement as an AI safety problem, concluding as follows:

But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI.

Note that the above talked about “stable self-modification” instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven’t really found a serious proposal to try to formalize and prove the latter kind of statement.

IMO, formalizing ‘running this AI system will be beneficial’ is itself an informal and error-prone process, where the only way to gain confidence in its correctness is for many competent researchers to try and fail to find flaws in the formalization. Instead of doing that, one could gain confidence in the AI’s safety by directly trying to find flaws (considered informally) in the AI design, and trying to prove or demonstrate via empirical testing narrower safety-relevant statements like “stable self-modification”, and given enough resources perhaps reach a similar level of confidence. (So the empirical statement doesn’t seem to make sense as written.)

The former still has the advantage that the size of the thing that might be flawed is much smaller (i.e., just the formalization of ‘running this AI system will be beneficial’ instead of the whole AI design), but it has the disadvantage that finding a proof might be very costly both in terms of research effort and in terms of additional constraint on AI design (to allow for a proof) making the AI less competitive. Overall, it seems like it’s too early to reach a strong conclusion one way or another as to which approach is more advisable.
- FactorialCode 6 Aug 2019 12:40 UTC
  7 points
  0
  Parent
  At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.
  
  From Safety Engineering for Artificial General Intelligence:
  
  An AI built in the Artificial General Intelligence paradigm, in which the design is engineered de novo, has the advantage over humans with respect to transparency of disposition, since it is able to display its source code, which can then be reviewed for trustworthiness (Salamon, Rayhawk, and Kramár 2010; Sotala 2012). Indeed, with an improved intelligence, it might find a way to formally prove its benevolence. If weak early AIs are incentivized to adopt verifiably or even provably benevolent dispositions, these can be continually verified or proved and thus retained, even as the AIs gain in intelligence and eventually reach the point where they have the power to renege without retaliation (Hall 2007a).
  
  Also, from section 2 of Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda:
  
  When constructing intelligent systems which learn and interact with all the complexities of reality, it is not sufficient to verify that the algorithm behaves well in test settings. Additional work is necessary to verify that the system will continue working as intended in application. This is especially true of systems possessing general intelligence at or above the human level: superintelligent machines might find strategies and execute plans beyond both the experience and imagination of the programmers, making the clever oscillator of Bird and Layzell look trite. At the same time, unpredictable behavior from smarter-than-human systems could cause catastrophic damage, if they are not aligned with human interests (Yudkowsky 2008). Because the stakes are so high, testing combined with a gut-level intuition that the system will continue to work outside the test environment is insufficient, even if the testing is extensive. It is important to also have a formal understanding of precisely why the system is expected to behave well in application. What constitutes a formal understanding? It seems essential to us to have both (1) an understanding of precisely what problem the system is intended to solve; and (2) an understanding of precisely why this practical system is expected to solve that abstract problem. The latter must wait for the development of practical smarter than-human systems, but the former is a theoretical research problem that we can already examine.
  
  I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
  - Rob Bensinger 6 Aug 2019 18:30 UTC
    13 points
    0
    Parent
    MIRIx events are funded by MIRI, but we don’t decide the topics or anything. I haven’t taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don’t see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi and the idea of a “safety-story” in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia for better attempts to characterize what MIRI is looking for.
    ETA: From the end of the latter dialogue,
    In point of fact, the real reason the author is listing out this methodology is that he’s currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won’t want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access a result inside subspace Q with properties P, and I believe a space with properties P does not include any weight combinations that figure out how to kill the programmer.”
    Though this itself is not really a reduced statement and still has too much goal-laden language in it.
    Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
    - Rob Bensinger 6 Aug 2019 18:36 UTC
      3 points
      0
      Parent
      Also the discussion of deconfusion research in https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/ and https://www.lesswrong.com/posts/Gg9a4y8reWKtLe3Tn/the-rocket-alignment-problem , and the sketch of ‘why this looks like a hard problem in general’ in https://www.lesswrong.com/posts/zEvqFtT4AtTztfYC4/optimization-amplifies and https://arbital.com/p/aligning_adds_time/ .
- Rohin Shah 6 Aug 2019 18:44 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Not sure who you have in mind as people believing this
  I don’t have particular people in mind, it’s more of a general “vibe” I get from talking to people. In the past, when I’ve stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.