TLDR: Many ideas in this series have been written about before by others. But it also seems to me that some of the ideas are new and important, or if not new then at least under-discussed. But I could be wrong.
There are many big and small ideas in this series. Many of them are have been described or alluded to elsewhere by others. But there are also ideas in this series that (1) seem to me like they probably are important and (2) I cannot remember having seen described elsewhere.
I’ve had a hobby interest in the topic superintelligence since 2009, and in the topic of AI Alignment since 2014. So I’ve read a lot of what has been written, but there is also a lot that I haven’t read. I could be presenting ideas in this series that seem new/interesting to me, but that actually are old.
Here are some writings I am aware of that I think are relevant to ideas in this series (some of them have influenced me, and some of them I know overlap quite a bit):
Eric Drexler has written extensively about principles and techniques for designing AGI-systems that don’t have agent-like behavior. In Comprehensive AI Services as General Intelligence he—well, I’m not goanna do his 200-page document justice, but one of the things he writes about is having AGI-systems consisting of more narrowly intelligent sub-systems (that are constrained and limited in what they do).
Several people have talked about using AIs to help with work on AI safety. For example, Paul Christiano writes: “I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research.”
To me both Eliezer’s and Paul’s intuitions about FOOM seem like they plausibly could be correct. However, I think my object-level intuitions are more similar to Eliezer’s than Paul’s. Which partly explains my interest in how we might go about getting help from an AGI-system with alignment after it has become catastrophically dangerous. If AI-systems help us to more or less solve alignment before they become catastrophically dangerous, then that would of course be preferable—and, I dunno, perhaps they will, but I prefer to contemplate scenarios where they don’t.
Several people have pointed out that verification often is easier than generation, and that this is a principle we can make heavy use of in AI Alignment. Paul Christiano writes: “Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. (...) I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.”
It seems to me as well that Eliezer is greatly underestimating the power of verifiability. At the same time, I know Eliezer is really smart and has thought deeply about AI Alignment. Eliezer seems to think quite differently from me on this topic, but the reasons why are not clear to me, and this irks me.
Steve Omohundro (a fellow enthusiast of proofs, verifiability, and of combining symbolic and connectionist systems) has written about Safe-AI Scaffolding. He writes: “Ancient builders created the idea of first building a wood form on top of which the stone arch could be built. Once the arch was completed and stable, the wood form could be removed. (...) We can safely develop autonomous technologies in a similar way. We build a sequence of provably-safe autonomous systems which are used in the construction of more powerful and less limited successor systems. The early systems are used to model human values and governance structures. They are also used to construct proofs of safety and other desired characteristics for more complex and less limited successor systems.”
As John Maxwell pointed out to me, Stuart Armstrong has written about Low-Bandwidth Oracles. In the same comment he wrote “I’m probably not the best person to red team this since some of my own alignment ideas are along similar lines”.
In his book Superintelligence, Nick Bostrom writes: “We could ask such an oracle questions of a type for which it is difficult to find the answer but easy to verify whether a given answer is correct. Many mathematical problems are of this kind. If we are wondering whether a mathematical proposition is true, we could ask the oracle to produce a proof or disproof of the proposition. Finding the proof may require insight and creativity beyond our ken, but checking a purported proof’s validity can be done by a simple mechanical procedure. (...) For instance, one might ask for the solution to various technical or philosophical problems that may arise in the course of trying to develop more advanced motivation selection methods. If we had a proposed AI design alleged to be safe, we could ask an oracle whether it could identify any significant flaw in the design, and whether it could explain any such flaw to us in twenty words or less. (...) Suffice it here to note that the protocol determining which questions are asked, in which sequence, and how the answers are reported and disseminated could be of great significance. One might also consider whether to try to build the oracle in such a way that it would refuse to answer any question in cases where it predicts that its answering would have consequences classified as catastrophic according to some rough-and-ready criteria.”
There are at least a couple of posts on LessWrong with titles such as Bootstrapped Alignment.
I am reminded a bit of experiments where people are told to hum a melody, and how it’s often quite hard for others to guess what song you are humming. AI Alignment discussion feels to me a bit like that sometimes—it’s hard to convey exactly what I have in mind, and hard to guess exactly what others have in mind. Often there is a lot of inferential distance, and we are forced to convey great networks of thoughts and concepts over the low bandwidth medium of text/speech.
One of the things that bothers me most about SI is that there is practically no public content, as far as I can tell, explicitly addressing the idea of a “tool” and giving arguments for why AGI is likely to work only as an “agent.”
In his response to this, Eliezer wrote the following:
Tool AI wasn’t the obvious solution to John McCarthy, I.J. Good, or Marvin Minsky. Today’s leading AI textbook, Artificial Intelligence: A Modern Approach—where you can learn all about A* search, by the way—discusses Friendly AI and AI risk for 3.5 pages but doesn’t mention tool AI as an obvious solution. For Ray Kurzweil, the obvious solution is merging humans and AIs. For Jurgen Schmidhuber, the obvious solution is AIs that value a certain complicated definition of complexity in their sensory inputs. Ben Goertzel, J. Storrs Hall, and Bill Hibbard, among others, have all written about how silly Singinst is to pursue Friendly AI when the solution is obviously X, for various different X. Among current leading people working on serious AGI programs labeled as such, neither Demis Hassabis (VC-funded to the tune of several million dollars) nor Moshe Looks (head of AGI research at Google) nor Henry Markram (Blue Brain at IBM) think that the obvious answer is Tool AI. Vernor Vinge, Isaac Asimov, and any number of other SF writers with technical backgrounds who spent serious time thinking about these issues didn’t converge on that solution.
In conclusion, I wouldn’t write this series if I didn’t think it could be useful. But as a fellow hairless monkey trying to do his best, it’s hard for me to confidently distinguish ideas that are new/helpful from ideas that are old/obvious/misguided. I appreciate any help in disabusing me of false beliefs.
Are any of the ideas in this series new?
TLDR: Many ideas in this series have been written about before by others. But it also seems to me that some of the ideas are new and important, or if not new then at least under-discussed. But I could be wrong.
There are many big and small ideas in this series. Many of them are have been described or alluded to elsewhere by others. But there are also ideas in this series that (1) seem to me like they probably are important and (2) I cannot remember having seen described elsewhere.
I’ve had a hobby interest in the topic superintelligence since 2009, and in the topic of AI Alignment since 2014. So I’ve read a lot of what has been written, but there is also a lot that I haven’t read. I could be presenting ideas in this series that seem new/interesting to me, but that actually are old.
Here are some writings I am aware of that I think are relevant to ideas in this series (some of them have influenced me, and some of them I know overlap quite a bit):
Eric Drexler has written extensively about principles and techniques for designing AGI-systems that don’t have agent-like behavior. In Comprehensive AI Services as General Intelligence he—well, I’m not goanna do his 200-page document justice, but one of the things he writes about is having AGI-systems consisting of more narrowly intelligent sub-systems (that are constrained and limited in what they do).
Paul Christiano and others have written about concepts such as Iterated Distillation and Amplification and AI safety techniques revolving around debate.
Several people have talked about using AIs to help with work on AI safety. For example, Paul Christiano writes: “I think Eliezer is probably wrong about how useful AI systems will become, including for tasks like AI alignment, before it is catastrophically dangerous. I believe we are relatively quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research. By the time AI systems can double the pace of AI research, it seems like they can greatly accelerate the pace of alignment research.”
To me both Eliezer’s and Paul’s intuitions about FOOM seem like they plausibly could be correct. However, I think my object-level intuitions are more similar to Eliezer’s than Paul’s. Which partly explains my interest in how we might go about getting help from an AGI-system with alignment after it has become catastrophically dangerous. If AI-systems help us to more or less solve alignment before they become catastrophically dangerous, then that would of course be preferable—and, I dunno, perhaps they will, but I prefer to contemplate scenarios where they don’t.
Several people have pointed out that verification often is easier than generation, and that this is a principle we can make heavy use of in AI Alignment. Paul Christiano writes: “Eliezer seems to argue that humans couldn’t verify pivotal acts proposed by AI systems (e.g. contributions to alignment research), and that this further makes it difficult to safely perform pivotal acts. (...) I think that this claim is probably wrong and clearly overconfident. I think it doesn’t match well with pragmatic experience in R&D in almost any domain, where verification is much, much easier than generation in virtually every domain.”
It seems to me as well that Eliezer is greatly underestimating the power of verifiability. At the same time, I know Eliezer is really smart and has thought deeply about AI Alignment. Eliezer seems to think quite differently from me on this topic, but the reasons why are not clear to me, and this irks me.
Steve Omohundro (a fellow enthusiast of proofs, verifiability, and of combining symbolic and connectionist systems) has written about Safe-AI Scaffolding. He writes: “Ancient builders created the idea of first building a wood form on top of which the stone arch could be built. Once the arch was completed and stable, the wood form could be removed. (...) We can safely develop autonomous technologies in a similar way. We build a sequence of provably-safe autonomous systems which are used in the construction of more powerful and less limited successor systems. The early systems are used to model human values and governance structures. They are also used to construct proofs of safety and other desired characteristics for more complex and less limited successor systems.”
As John Maxwell pointed out to me, Stuart Armstrong has written about Low-Bandwidth Oracles. In the same comment he wrote “I’m probably not the best person to red team this since some of my own alignment ideas are along similar lines”.
In his book Superintelligence, Nick Bostrom writes: “We could ask such an oracle questions of a type for which it is difficult to find the answer but easy to verify whether a given answer is correct. Many mathematical problems are of this kind. If we are wondering whether a mathematical proposition is true, we could ask the oracle to produce a proof or disproof of the proposition. Finding the proof may require insight and creativity beyond our ken, but checking a purported proof’s validity can be done by a simple mechanical procedure. (...) For instance, one might ask for the solution to various technical or philosophical problems that may arise in the course of trying to develop more advanced motivation selection methods. If we had a proposed AI design alleged to be safe, we could ask an oracle whether it could identify any significant flaw in the design, and whether it could explain any such flaw to us in twenty words or less. (...) Suffice it here to note that the protocol determining which questions are asked, in which sequence, and how the answers are reported and disseminated could be of great significance. One might also consider whether to try to build the oracle in such a way that it would refuse to answer any question in cases where it predicts that its answering would have consequences classified as catastrophic according to some rough-and-ready criteria.”
There are at least a couple of posts on LessWrong with titles such as Bootstrapped Alignment.
I am reminded a bit of experiments where people are told to hum a melody, and how it’s often quite hard for others to guess what song you are humming. AI Alignment discussion feels to me a bit like that sometimes—it’s hard to convey exactly what I have in mind, and hard to guess exactly what others have in mind. Often there is a lot of inferential distance, and we are forced to convey great networks of thoughts and concepts over the low bandwidth medium of text/speech.
I am also reminded a bit about Holden Karnofsky’s article Thoughts on the Singularity Institute (SI), where he wrote:
In his response to this, Eliezer wrote the following:
In conclusion, I wouldn’t write this series if I didn’t think it could be useful. But as a fellow hairless monkey trying to do his best, it’s hard for me to confidently distinguish ideas that are new/helpful from ideas that are old/obvious/misguided. I appreciate any help in disabusing me of false beliefs.