Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I’ve found to a statement of the empirical claim is from Eliezer’s 2012 Reply to Holden on ‘Tool AI’:
I’ve repeatedly said that the idea behind proving determinism of self-modification isn’t that this guarantees safety, but that if you prove the self-modification stable the AI might work, whereas if you try to get by with no proofs at all, doom is guaranteed.
But I am not yet convinced that stable self-improvement is an especially important problem for AI safety; I think it would be handled correctly by a human-level reasoner as a special case of decision-making under logical uncertainty. This suggests that (1) it will probably be resolved en route to human-level AI, (2) it can probably be “safely” delegated to a human-level AI.
Note that the above talked about “stable self-modification” instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven’t really found a serious proposal to try to formalize and prove the latter kind of statement.
IMO, formalizing ‘running this AI system will be beneficial’ is itself an informal and error-prone process, where the only way to gain confidence in its correctness is for many competent researchers to try and fail to find flaws in the formalization. Instead of doing that, one could gain confidence in the AI’s safety by directly trying to find flaws (considered informally) in the AI design, and trying to prove or demonstrate via empirical testing narrower safety-relevant statements like “stable self-modification”, and given enough resources perhaps reach a similar level of confidence. (So the empirical statement doesn’t seem to make sense as written.)
The former still has the advantage that the size of the thing that might be flawed is much smaller (i.e., just the formalization of ‘running this AI system will be beneficial’ instead of the whole AI design), but it has the disadvantage that finding a proof might be very costly both in terms of research effort and in terms of additional constraint on AI design (to allow for a proof) making the AI less competitive. Overall, it seems like it’s too early to reach a strong conclusion one way or another as to which approach is more advisable.
At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.
An AI built in the Artificial General Intelligence paradigm, in which the design is
engineered de novo, has the advantage over humans with respect to transparency of
disposition, since it is able to display its source code, which can then be reviewed for
trustworthiness (Salamon, Rayhawk, and Kramár 2010; Sotala 2012). Indeed, with an
improved intelligence, it might find a way to formally prove its benevolence. If weak
early AIs are incentivized to adopt verifiably or even provably benevolent dispositions,
these can be continually verified or proved and thus retained, even as the AIs gain in
intelligence and eventually reach the point where they have the power to renege without
retaliation (Hall 2007a).
When constructing intelligent systems which learn
and interact with all the complexities of reality, it is not
sufficient to verify that the algorithm behaves well in test
settings. Additional work is necessary to verify that the
system will continue working as intended in application.
This is especially true of systems possessing general
intelligence at or above the human level: superintelligent
machines might find strategies and execute plans beyond
both the experience and imagination of the programmers,
making the clever oscillator of Bird and Layzell look
trite. At the same time, unpredictable behavior from
smarter-than-human systems could cause catastrophic
damage, if they are not aligned with human interests
(Yudkowsky 2008).
Because the stakes are so high, testing combined
with a gut-level intuition that the system will continue
to work outside the test environment is insufficient, even
if the testing is extensive. It is important to also have
a formal understanding of precisely why the system is
expected to behave well in application.
What constitutes a formal understanding? It seems
essential to us to have both (1) an understanding of
precisely what problem the system is intended to solve;
and (2) an understanding of precisely why this practical
system is expected to solve that abstract problem. The
latter must wait for the development of practical smarter than-human systems, but the former is a theoretical
research problem that we can already examine.
I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
In point of fact, the real reason the author is listing out this methodology is that he’s currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won’t want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access a result inside subspace Q with properties P, and I believe a space with properties P does not include any weight combinations that figure out how to kill the programmer.”
Though this itself is not really a reduced statement and still has too much goal-laden language in it.
Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
Not sure who you have in mind as people believing this
I don’t have particular people in mind, it’s more of a general “vibe” I get from talking to people. In the past, when I’ve stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.
Not sure who you have in mind as people believing this, but after searching both LW and Arbital, the closest thing I’ve found to a statement of the empirical claim is from Eliezer’s 2012 Reply to Holden on ‘Tool AI’:
Paul Christiano argued against this at length in Stable self-improvement as an AI safety problem, concluding as follows:
Note that the above talked about “stable self-modification” instead of ‘running this AI system will be beneficial’, and the former is a much narrower and easier to formalize concept than the latter. I haven’t really found a serious proposal to try to formalize and prove the latter kind of statement.
IMO, formalizing ‘running this AI system will be beneficial’ is itself an informal and error-prone process, where the only way to gain confidence in its correctness is for many competent researchers to try and fail to find flaws in the formalization. Instead of doing that, one could gain confidence in the AI’s safety by directly trying to find flaws (considered informally) in the AI design, and trying to prove or demonstrate via empirical testing narrower safety-relevant statements like “stable self-modification”, and given enough resources perhaps reach a similar level of confidence. (So the empirical statement doesn’t seem to make sense as written.)
The former still has the advantage that the size of the thing that might be flawed is much smaller (i.e., just the formalization of ‘running this AI system will be beneficial’ instead of the whole AI design), but it has the disadvantage that finding a proof might be very costly both in terms of research effort and in terms of additional constraint on AI design (to allow for a proof) making the AI less competitive. Overall, it seems like it’s too early to reach a strong conclusion one way or another as to which approach is more advisable.
At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.
From Safety Engineering for Artificial General Intelligence:
Also, from section 2 of Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda:
I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
MIRIx events are funded by MIRI, but we don’t decide the topics or anything. I haven’t taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don’t see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi and the idea of a “safety-story” in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia for better attempts to characterize what MIRI is looking for.
ETA: From the end of the latter dialogue,
Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
Also the discussion of deconfusion research in https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/ and https://www.lesswrong.com/posts/Gg9a4y8reWKtLe3Tn/the-rocket-alignment-problem , and the sketch of ‘why this looks like a hard problem in general’ in https://www.lesswrong.com/posts/zEvqFtT4AtTztfYC4/optimization-amplifies and https://arbital.com/p/aligning_adds_time/ .
I don’t have particular people in mind, it’s more of a general “vibe” I get from talking to people. In the past, when I’ve stated the empirical claim, some people agreed with it, but upon further discussion it turned out they actually agreed with the normative claim. Hence my first question, which was to ask whether or not people believe the empirical claim.