There are a number of possibilities still missing from the discussion in the post. For example:
There might not be any such thing as a friendly AI. Yes, we have every reason to believe that the space of possible minds is huge, and it’s also very clear that some possibilities are less unfriendly than others. I’m also not making an argument that fun is a limited resource. I’m just saying that there may be no possible AI that takes over the world without eventually running off the rails of fun. In fact, the question itself seems superficially similar to the halting problem, where “running off the rails” is the analogue for “halting”; suggesting that even if friendliness existed, it might not be rigorously provable. (note: this analogy doesn’t say what I think it says; see response below. But I still mean to say what I thought; a friendly world may be fundamentally less stable than a simple infinite loop, perhaps to the point of being unprovable.)
Alternatively, building a “Friendly-enough” AI may be easier than you think. Consider the game of go. Human grandmasters (professional 9-dan players) have speculated that “God” (that is, perfect play) would rate about 13 dan professionally; that is, that they could beat such a player more than half the time given a 3 or 4 stone handicap. Replace “go” with “taking over the world”, “professional 9-dan player” with “all of humanity put together”, and “3 or 4 stone handicap” with “relatively simple-to-implement Asimov-type safeguards”, and it is possible that this describes the world. And it is also possible that a planetary computer would still “only be 12-dan”; that is, that additional computing power shows sharply diminishing intelligence returns at some point “short of perfection”, to the point where a mega-computer would still be noticeably imperfect.
There may be good reasons not to spend much time thinking about the possibilities that FAI is impossible or “easy”. I know that people around here have plenty of plausible arguments for why these possibilities are small; and even if they are appreciable, the contrary possibility (that FAI is possible but hard) is probably where the biggest payoffs lie, and so merits our focus. And the OP discussion does seem valid for that possible-hard case. But I still think it would be improved by stating these assumptions up-front, rather than hiding or forgetting about them.
In fact, the question itself seems superficially similar to the halting problem, where “running off the rails” is the analogue for “halting”
If you want to draw an analogy to halting, then what that analogy actually says is: There are lots of programs that provably halt, and lots that provably don’t halt, and lots that aren’t provable either way. The impossibility of the halting problem is irrelevant, because we don’t need a fully general classifier that works for every possible program. We only need to find a single program that provably has behavior X (for some well-chosen value of X).
If you’re postulating that there are some possible friendly behaviors, and some possible programs with those behaviors, but that they’re all in the unprovable category, then you’re postulating that friendliness is dissimilar to the halting problem in that respect.
Moreover, the halting problem doesn’t show that the set of programs you can’t decide halting for is in any way interesting.
It’s a constructive proof, yes, but it constructs a peculiarly twisted program that embeds its own proof-checker. That might be relevant for AGI, but for almost every program in existence we have no idea which group it’s in, and would likely guess it’s provable.
It’s still probably premature to guess whether friendliness is provable when we don’t have any idea what it is. My worry is not that it wouldn’t be possible or provable, but that it might not be a meaningful term at all.
But I also suspect friendliness, if it does mean anything, is in general going to be so complex that “only [needing] to find a single program that provably has behaviour X” may be beyond us. There are lots of mathematical conjectures we can’t prove, even without invoking the halting problem.
One terrible trap might be the temptation to make simplifications in the model to make the problem provable, but end up proving the wrong thing. Maybe you can prove that a set of friendliness criteria are stable under self-modification, but I don’t see any way to prove those starting criteria don’t have terrible unintended consequences. Those are contingent on too many real-world circumstances and unknown unknowns. How do you even model that?
There are a number of possibilities still missing from the discussion in the post. For example:
There might not be any such thing as a friendly AI. Yes, we have every reason to believe that the space of possible minds is huge, and it’s also very clear that some possibilities are less unfriendly than others. I’m also not making an argument that fun is a limited resource. I’m just saying that there may be no possible AI that takes over the world without eventually running off the rails of fun. In fact, the question itself seems superficially similar to the halting problem, where “running off the rails” is the analogue for “halting”; suggesting that even if friendliness existed, it might not be rigorously provable. (note: this analogy doesn’t say what I think it says; see response below. But I still mean to say what I thought; a friendly world may be fundamentally less stable than a simple infinite loop, perhaps to the point of being unprovable.)
Alternatively, building a “Friendly-enough” AI may be easier than you think. Consider the game of go. Human grandmasters (professional 9-dan players) have speculated that “God” (that is, perfect play) would rate about 13 dan professionally; that is, that they could beat such a player more than half the time given a 3 or 4 stone handicap. Replace “go” with “taking over the world”, “professional 9-dan player” with “all of humanity put together”, and “3 or 4 stone handicap” with “relatively simple-to-implement Asimov-type safeguards”, and it is possible that this describes the world. And it is also possible that a planetary computer would still “only be 12-dan”; that is, that additional computing power shows sharply diminishing intelligence returns at some point “short of perfection”, to the point where a mega-computer would still be noticeably imperfect.
There may be good reasons not to spend much time thinking about the possibilities that FAI is impossible or “easy”. I know that people around here have plenty of plausible arguments for why these possibilities are small; and even if they are appreciable, the contrary possibility (that FAI is possible but hard) is probably where the biggest payoffs lie, and so merits our focus. And the OP discussion does seem valid for that possible-hard case. But I still think it would be improved by stating these assumptions up-front, rather than hiding or forgetting about them.
If you want to draw an analogy to halting, then what that analogy actually says is: There are lots of programs that provably halt, and lots that provably don’t halt, and lots that aren’t provable either way. The impossibility of the halting problem is irrelevant, because we don’t need a fully general classifier that works for every possible program. We only need to find a single program that provably has behavior X (for some well-chosen value of X).
If you’re postulating that there are some possible friendly behaviors, and some possible programs with those behaviors, but that they’re all in the unprovable category, then you’re postulating that friendliness is dissimilar to the halting problem in that respect.
Moreover, the halting problem doesn’t show that the set of programs you can’t decide halting for is in any way interesting.
It’s a constructive proof, yes, but it constructs a peculiarly twisted program that embeds its own proof-checker. That might be relevant for AGI, but for almost every program in existence we have no idea which group it’s in, and would likely guess it’s provable.
It’s still probably premature to guess whether friendliness is provable when we don’t have any idea what it is. My worry is not that it wouldn’t be possible or provable, but that it might not be a meaningful term at all.
But I also suspect friendliness, if it does mean anything, is in general going to be so complex that “only [needing] to find a single program that provably has behaviour X” may be beyond us. There are lots of mathematical conjectures we can’t prove, even without invoking the halting problem.
One terrible trap might be the temptation to make simplifications in the model to make the problem provable, but end up proving the wrong thing. Maybe you can prove that a set of friendliness criteria are stable under self-modification, but I don’t see any way to prove those starting criteria don’t have terrible unintended consequences. Those are contingent on too many real-world circumstances and unknown unknowns. How do you even model that?