At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.
An AI built in the Artificial General Intelligence paradigm, in which the design is
engineered de novo, has the advantage over humans with respect to transparency of
disposition, since it is able to display its source code, which can then be reviewed for
trustworthiness (Salamon, Rayhawk, and Kramár 2010; Sotala 2012). Indeed, with an
improved intelligence, it might find a way to formally prove its benevolence. If weak
early AIs are incentivized to adopt verifiably or even provably benevolent dispositions,
these can be continually verified or proved and thus retained, even as the AIs gain in
intelligence and eventually reach the point where they have the power to renege without
retaliation (Hall 2007a).
When constructing intelligent systems which learn
and interact with all the complexities of reality, it is not
sufficient to verify that the algorithm behaves well in test
settings. Additional work is necessary to verify that the
system will continue working as intended in application.
This is especially true of systems possessing general
intelligence at or above the human level: superintelligent
machines might find strategies and execute plans beyond
both the experience and imagination of the programmers,
making the clever oscillator of Bird and Layzell look
trite. At the same time, unpredictable behavior from
smarter-than-human systems could cause catastrophic
damage, if they are not aligned with human interests
(Yudkowsky 2008).
Because the stakes are so high, testing combined
with a gut-level intuition that the system will continue
to work outside the test environment is insufficient, even
if the testing is extensive. It is important to also have
a formal understanding of precisely why the system is
expected to behave well in application.
What constitutes a formal understanding? It seems
essential to us to have both (1) an understanding of
precisely what problem the system is intended to solve;
and (2) an understanding of precisely why this practical
system is expected to solve that abstract problem. The
latter must wait for the development of practical smarter than-human systems, but the former is a theoretical
research problem that we can already examine.
I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
In point of fact, the real reason the author is listing out this methodology is that he’s currently trying to do something similar on the problem of aligning Artificial General Intelligence, and he would like to move past “I believe my AGI won’t want to kill anyone” and into a headspace more like writing down statements such as “Although the space of potential weightings for this recurrent neural net does contain weight combinations that would figure out how to kill the programmers, I believe that gradient descent on loss function L will only access a result inside subspace Q with properties P, and I believe a space with properties P does not include any weight combinations that figure out how to kill the programmer.”
Though this itself is not really a reduced statement and still has too much goal-laden language in it.
Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
At some point, there was definitely discussion about formal verification of AI systems. At the very least, this MIRIx event seems to have been about the topic.
From Safety Engineering for Artificial General Intelligence:
Also, from section 2 of Agent Foundations for Aligning Machine Intelligence with Human Interests: A Technical Research Agenda:
I suspect that this approach has fallen out of favor as ML algorithms have gotten more capable while our ability to prove anything useful about those algorithms has heavily lagged behind. Although deep mind and a few others are is still trying.
MIRIx events are funded by MIRI, but we don’t decide the topics or anything. I haven’t taken a poll of MIRI researchers to see how enthusiastic different people are about formal verification, but AFAIK Nate and Eliezer don’t see it as super relevant. See https://www.lesswrong.com/posts/xCpuSfT5Lt6kkR3po/my-take-on-agent-foundations-formalizing-metaphilosophical#cGuMRFSi224RCNBZi and the idea of a “safety-story” in https://www.lesswrong.com/posts/8gqrbnW758qjHFTrH/security-mindset-and-ordinary-paranoia for better attempts to characterize what MIRI is looking for.
ETA: From the end of the latter dialogue,
Rather than putting the emphasis on being able to machine-verify all important properties of the system, this puts the emphasis on having strong technical insight into the system; I usually think of formal proofs more as a means to that end. (Again caveating that some people at MIRI might think of this differently.)
Also the discussion of deconfusion research in https://intelligence.org/2018/11/22/2018-update-our-new-research-directions/ and https://www.lesswrong.com/posts/Gg9a4y8reWKtLe3Tn/the-rocket-alignment-problem , and the sketch of ‘why this looks like a hard problem in general’ in https://www.lesswrong.com/posts/zEvqFtT4AtTztfYC4/optimization-amplifies and https://arbital.com/p/aligning_adds_time/ .