I have seen too many discussions of Friendly AI, here and elsewhere (e.g. in comments at Michael Anissimov’s blog), detached from any concrete idea of how to do it....
At present, it is discussed in conjunction with a whole cornucopia of science fiction notions such as: immortality, conquering the galaxy, omnipresent wish-fulfilling super-AIs, good and bad Jupiter-brains, mind uploads in heaven and hell, and so on. Similarly, we have all these thought-experiments: guessing games with omniscient aliens, decision problems in a branching multiverse, “torture versus dust specks”. Whatever the ultimate relevance of such ideas, it is clearly possible to divorce the notion of Friendly AI from all of them....
SIAI, in discussing the quest for the right goal system, emphasizes the difficulties of this process and the unreliability of human judgment. Their idea of a solution is to use artificial intelligence to neuroscientifically deduce the actual algorithmic structure of human decision-making, and to then employ a presently nonexistent branch of decision theory to construct a goal system embodying ideals implicit in the unknown human cognitive algorithms.
In short, there is a dangerous and almost universal tendency to think about FAI (and AGI generally) primarily in far mode. Yes!
However, I’m less enamored with the rest of your post. The reason is that building AGI is simply an altogether higher-risk activity than traveling to the moon. Using “build a chemical powered rocket” as your starting point for getting to the moon is reasonable in part because the worst that could plausibly happen is that the rocket will blow up and kill a lot of volunteers who knew what they were getting into. In the case of FAI, Eliezer Yudkowsky has taken great pains to show that the slightest, subtlest mistake, one which could easily pass through any number of rounds of committee decision making, coding, and code checking, could lead to existence failure for humanity. He has also taken pains to show that approaches to the problem which entire committees have in the past thought were a really good idea, would also lead to such a disaster. As far as I can tell, the LessWrong consensus agrees with him on the level of risk here, at least implicitly.
There is another approach. My own research pertains to automated theorem proving, and its biggest application, software verification. We would still need to produce a formal account of the invariants we’d want the AGI to preserve, i.e., a formal account of what it means to respect human values. When I say “formal”, I mean it: a set of sentences in a suitable formal symbolic logic, carefully chosen to suit the task at hand. Then we would produce a mathematical proof that our code preserves the invariants, or, more likely we would use techniques for producing the code and the proof at the same time. So we’d more or less have a mathematical proof that the AI is Friendly. I don’t know how the SIAI is trying to think about the problem now, exactly, but I don’t think Eliezer would be satisfied by anything less certain than this sort of approach.
Not that this outline, at this point, is satisfactory. The formalization of human value is a massive problem, and arguably where most of the trouble lies anyway. I don’t think anyone’s ever solved anything even close to this. But I’d argue that this outline does clarify matters a bit, because we have a better idea what a solution to this problem would look like. And it makes it clear how dangerous the loose approach recommended here is: virtually all software has bugs, and a non-verified recursively self-improving AI could magnify a bug in its value system until it no better approximates human values than does paperclip-maximizing. Moreover, the formal proof doesn’t do anyone a bit of good if the invariants were not designed correctly.
In short, there is a dangerous and almost universal tendency to think about FAI (and AGI generally) primarily in far mode. Yes!
However, I’m less enamored with the rest of your post. The reason is that building AGI is simply an altogether higher-risk activity than traveling to the moon. Using “build a chemical powered rocket” as your starting point for getting to the moon is reasonable in part because the worst that could plausibly happen is that the rocket will blow up and kill a lot of volunteers who knew what they were getting into. In the case of FAI, Eliezer Yudkowsky has taken great pains to show that the slightest, subtlest mistake, one which could easily pass through any number of rounds of committee decision making, coding, and code checking, could lead to existence failure for humanity. He has also taken pains to show that approaches to the problem which entire committees have in the past thought were a really good idea, would also lead to such a disaster. As far as I can tell, the LessWrong consensus agrees with him on the level of risk here, at least implicitly.
There is another approach. My own research pertains to automated theorem proving, and its biggest application, software verification. We would still need to produce a formal account of the invariants we’d want the AGI to preserve, i.e., a formal account of what it means to respect human values. When I say “formal”, I mean it: a set of sentences in a suitable formal symbolic logic, carefully chosen to suit the task at hand. Then we would produce a mathematical proof that our code preserves the invariants, or, more likely we would use techniques for producing the code and the proof at the same time. So we’d more or less have a mathematical proof that the AI is Friendly. I don’t know how the SIAI is trying to think about the problem now, exactly, but I don’t think Eliezer would be satisfied by anything less certain than this sort of approach.
Not that this outline, at this point, is satisfactory. The formalization of human value is a massive problem, and arguably where most of the trouble lies anyway. I don’t think anyone’s ever solved anything even close to this. But I’d argue that this outline does clarify matters a bit, because we have a better idea what a solution to this problem would look like. And it makes it clear how dangerous the loose approach recommended here is: virtually all software has bugs, and a non-verified recursively self-improving AI could magnify a bug in its value system until it no better approximates human values than does paperclip-maximizing. Moreover, the formal proof doesn’t do anyone a bit of good if the invariants were not designed correctly.