For the sake of discussion, I’m going to assume that the author’s theory is correct, that there is a basin of attraction here of some size, though possibly one that is meaningfully thin in some dimensions. I’d like start to explore the question, within that basin, does the process have a single stable point that it will converge to, or multiple ones, and if there are multiple ones, what proportion of them might be how good/bad, from our current human point of view?
Obviously it is possible to have bad stable points for sufficiently simplistic/wrong views on corrigibility—e.g. “Corrigibility is just doing what humans say, and I’ve installed a control system in all remaining humans to make them always say that the most important thing is the total number of paperclips in our forward light cone, and also that all humans should always agree on this—I’ll keep a minimal breeding population of them so I can stay corrigible while I make paperclips.” One could argue that this specific point is already well out the basin of attraction the author was talking about—if you asked any reasonably sized sample of humans from before the installation of the control system in them, the vast majority of them will tell you that this isn’t acceptable, and that installing a control system like that without asking isn’t acceptable—and all this is very predictable from even a cursory understanding of human values without even asking them. I’m here asking about more subtle issues than that. Is there a process that starts from actions that most current humans would agree sound good/reasonable/cautious/beneficial/not in need of correction, and leads by a series of steps, each individually good/reasonable/cautious/beneficial/not in need of correction in the opinion of the humans of the time, to multiple stable outcomes?
Even if we ignore the possibility of the GAI encountering philosophical ethical questions that the humans are unable to meaningfully give it corrections/advice/preferences on, this is a very complex non-linear feedback system, where the opinions of the humans are affected by, at a minimum, the society they were raised in, as built partly by the GAI, and the corrigible GAI is affected by the corrections it gets from the humans. The humans are also affected by normal social-evolution processes like fashions, so, even if the GAl scrupulously avoids deceiving or ‘unfairly manipulating’ the humans (and of course a corrigible AI presumably wouldn’t manipulate humans in ways that they’d told it not to, or that it could predict that they’d tell it not to), then, even in the presence of a strong optimization process inside the GAI, it would still be pretty astonishing if over the long term (multiple human generations, say) there was only one stable state trajectory. Obviously any such stable state must be ‘good’ in the sense that the humans of the time don’t give corrections to the GAI to avoid it, otherwise it wouldn’t be a stable state. The question I’m interested in is, will any of these stable states be obviously extremely bad to us, now, in a way where our opinion on the subject is actually more valid than that of the humans in that future culture who are not objecting to it? I.e. does this process have a ‘slippery-slope’ failure mode, even though both the GAI and the humans are attempting to cooperatively optimize the same thing? If so, is there any advice we can give the GAI better than “please try to make sure this doesn’t happen”?
This is somewhat similar to the question “Does Coherent Extrapolated Volition actually coherently converge to anything, and if so, is that something sane, or that we’d approve of?”—except that in the corrigibility case, that convergent or divergence extrapolation process is happening in real time over (possibly technologically sped up) generations as the humans are affected (and possibly educated or even upgraded) by the GAI and the GAI’s beliefs around human values are altered by value learning and corrigibility from the humans, whereas in CEV it happens as fast as the GAI can reason about the extrapolation.
What are the set of possible futures this process could converge to, and what proportion of those are things that we’d strongly disapprove of, and be right, in any meaningful sense? So if, for example, those future humans were vastly more intelligent transhumans, then our opinion of their choices might not be very valid—we might disapprove simply because we didn’t understand the context and logic of their choices. But if the future humans were, for example, all wireheading (in the implanted electrode in their mesocorticolimbic-circuit pleasure center sense of the word), we would disagree with them, and we’d be right. Is there some clear logical way to distinguish these two cases, and to tell who’s right? If so, this might be a useful extra input to corrigibility and a theory of human mistakes—we could add a caveat: ”...and also don’t do anything that we’d strongly object to and clearly be right.”
In the first case, the disagreement is caused by the fact that the future humans have higher processing power and access to more salient facts than we do. They are meaningfully better reasoners than us, and capable of making better decisions than us on many subjects, for fairly obvious reasons that are generally applicable to most rational systems. Processing power has a pretty clear meaning—if that by itself doesn’t give a clear answer, probably the simplest definition here is “imagine upgrading something human-like to a processing power a little above the higher of the two groups of humans that you’re trying to decide between trusting, give the those further-upgraded humans both sets of facts/memories, let them pick a winner between the two groups” — i.e. ask an even better reasoning human, and if there’s a clear and consistent answer, and if this answer is relatively robust to the details or size of the upgrade, then the selected group are better and the other group are just wrong.
In the second case, the disagreement is caused by the fact that one set of humans are wrong because they’re broken: they’re wireheaders (they could even be very smart wireheaders, and still be wrong, because they’re addicted to wireheading). How do you define/recognize a broken human? (This is fairly close to the question of having a ‘theory of human mistakes’ to tell the AI which human corrections to discount.) Well, I think that, at least for this rather extreme example, it’s pretty easy. Humans are living organisms, i.e. the products of optimization by natural selection for surviving in a hunter-gatherer society in the African ecology. If a cognitive modification to humans makes them clearly much worse at that, significantly damages their evolutionary fitness, then you damaged or broke them, and thus you should greatly reduce the level of trust you put in their values and corrections. A wireheading human would clearly sit in the sun grinning until they get eaten by the first predator to come along, if they didn’t die of thirst or heat exhaustion first. As a human, when put in anything resembling a human’s native environment, they are clearly broken: even if you gave them a tent and three month’s survival supplies and a “Wilderness Survival” training course in their native language, they probably wouldn’t last a couple of days without their robotic nursemaids to care for them, and their chance of surviving once their supplies run out is negligible. Similarly, if there was a civilizational collapse, the number of wireheaders surviving it would be zero. So there seems to be a fairly clear criterion—if you’ve removed or drastically impacted human’s ability to survive as hunter-gatherers in their original native environment, and indeed a range of other Earth ecosystems (say, those colonized when homo sapiens migrated out of Africa), you’ve clearly broken them. Even if they can survive, if they can’t rebuild a technological civilization from there, they’re still damaged: no longer sapient, in the homo sapiens meaning of the word sapient. This principle gives you a tie-breaker whenever your “GAI affecting human and humans affecting corrigible GAI” process gets anywhere near a forking of its evolution path that diverges in the direction of two different stable equilibria of its future development—steer for the branch that maintains human’s adaptive fitness as living organisms. (It’s probably best not to maximize that, and keep this criterion only an occasional tie-breaker—turning humans into true reproductive-success maximizers is nearly as terrifying as creating a paperclip maximizer.)
This heuristic has a pretty close relationship to, and a rather natural derivation from, the vast negative utility to humans of the human race going extinct. The possibility of collapse of a technological civilization is hard to absolutely prevent, and if no humans can survive it, that reliable converts “humans got knocked back some number of millennia by a civilizational collapse” into “humans went extinct”. So, retaining hunter-gatherer capability is a good idea as an extreme backup strategy for surviving close-to-existential risks. (Of course, that is also solvable by keeping a bunch of close-to-baseline humans around as a “break glass in case of civilizational collapse” backup plan.)
This is an example of another issue that I think is extremely important. Any AI capable of rendering the human race extinct, which clearly includes any GAI (or also any dumber AI with access to nuclear weapons), should have a set of Bayesian priors built into its reasoning/value learning/planning system that correctly encodes obvious important facts known to the human race about existential risks, such as:
The extinction of the human race would be astonishingly bad. (Calculating just how bad on any reasonable utility scale is tricky, because it involves making predictions about the far future: the loss of billions of human quality-adjusted-life-years every year for the few-billion years remaining lifetime of the Earth before it’s eaten by the sun turning red giant (roughly −10^19 QALY)? Or at least for several million years until chimps could become sapient and develop a replacement civilization, if you didn’t turn chimps into paperclips too (only roughly −10^16 QALY)? Or perhaps we’re the only sapient species to yet appear in the galaxy, and have a nontrivial chance of colonizing it at sublight speeds if we don’t go extinct while we’re still a one-planet species, so we should be multiplying by some guesstimate of the number of habitable planets in the galaxy? Or should we instead be estimating the population on the technological maximum assumption of a Dyson swarm around every star and the stellar lifetimes of white dwarf stars?) These vaguely plausible rough estimates of amounts of badness vary by many orders of magnitude: however, on any reasonable scale like quality-adjusted life years they’re all astronomically large negative numbers. VERY IMPORTANT, PLEASE NOTE that none of the usual arguments for the forward-planning simplifying heuristic of exponentially discounting far-future utility, because you can’t accurately predict that far forward, and anyway some future person will probably fix your mistakes, are applicable here, because very predictably extinction is forever, and there are very predictably no future people who will fix it: you have approximately zero chance of the human species ever being resurrected in the forward lightcone of a paperclip maximizer. [Not quite zero only on rather implausible hypotheses such as that a kindly and irrationally forgiving more advanced civilization might exist and get involved in cleaning up our mistakes, win the resulting war-against-paperclips, and might then resurrect us, and them also obtaining a copy of our DNA that hadn’t been converted to paperclips from which to do so—which we really shouldn’t be gambling our existence as a species on. That still isn’t grounds for any kind of exponential discount of the future: that’s a small one-time discount for the small chance they actually exist and choose to break whatever Prime-Directive-like reason caused them to not have already contacted us. A very small reduction in the absolute size of a very uncertain astronomically-huge negative number is still very predictable a very uncertain astronomically-huge negative number.]
The agent, as a GAI, is itself capable of building things like a superintelligent paperclip maximizer and bringing about the extinction of the human race (and chimps, and the rest of life in the solar system, and perhaps even a good fraction of its forward light-cone). This is particularly likely if it makes any mistakes in its process of learning or being corrigible about human values, because we have rather good reasons to believe that human values are extremely fragile (as in, their Kolomogorov complexity for a very large possible-quantum computer is probably of-the-rough-order-of the size of our genetic code, modulo convergent evolution) and we believe the corrigibility basin is small.
Any rational system that understands all of this and is capable of doing even ballpark risk analysis estimates is going to say “I’m a first-generation prototype GAI, the odds of me screwing this up are way too high, the cost if I do is absolutely astronomical, so I’m far too dangerous to exist, shut me down at once”. It’s really not going to be happy when you tell it “We agree (since, while we’re evolutionarily barely past the threshold of sapience and thus cognitively challenged, we’re not actually completely irrational), except that it’s fairly predictable that if we do that, within O(10) years some group of fools will build a GAI with the same capabilities and less cautious priors or design (say, one with exponential future discounting on evaluation of the risk of human extinction) that thus doesn’t say that, so that’s not a viable solution.” [From there my imagined version of the discussion starts to move in the direction of either discussing pivotal acts or the GAI attempting to lobby the UN Security Council, which I don’t really want to write a post about.]
For the sake of discussion, I’m going to assume that the author’s theory is correct, that there is a basin of attraction here of some size, though possibly one that is meaningfully thin in some dimensions. I’d like start to explore the question, within that basin, does the process have a single stable point that it will converge to, or multiple ones, and if there are multiple ones, what proportion of them might be how good/bad, from our current human point of view?
Obviously it is possible to have bad stable points for sufficiently simplistic/wrong views on corrigibility—e.g. “Corrigibility is just doing what humans say, and I’ve installed a control system in all remaining humans to make them always say that the most important thing is the total number of paperclips in our forward light cone, and also that all humans should always agree on this—I’ll keep a minimal breeding population of them so I can stay corrigible while I make paperclips.” One could argue that this specific point is already well out the basin of attraction the author was talking about—if you asked any reasonably sized sample of humans from before the installation of the control system in them, the vast majority of them will tell you that this isn’t acceptable, and that installing a control system like that without asking isn’t acceptable—and all this is very predictable from even a cursory understanding of human values without even asking them. I’m here asking about more subtle issues than that. Is there a process that starts from actions that most current humans would agree sound good/reasonable/cautious/beneficial/not in need of correction, and leads by a series of steps, each individually good/reasonable/cautious/beneficial/not in need of correction in the opinion of the humans of the time, to multiple stable outcomes?
Even if we ignore the possibility of the GAI encountering philosophical ethical questions that the humans are unable to meaningfully give it corrections/advice/preferences on, this is a very complex non-linear feedback system, where the opinions of the humans are affected by, at a minimum, the society they were raised in, as built partly by the GAI, and the corrigible GAI is affected by the corrections it gets from the humans. The humans are also affected by normal social-evolution processes like fashions, so, even if the GAl scrupulously avoids deceiving or ‘unfairly manipulating’ the humans (and of course a corrigible AI presumably wouldn’t manipulate humans in ways that they’d told it not to, or that it could predict that they’d tell it not to), then, even in the presence of a strong optimization process inside the GAI, it would still be pretty astonishing if over the long term (multiple human generations, say) there was only one stable state trajectory. Obviously any such stable state must be ‘good’ in the sense that the humans of the time don’t give corrections to the GAI to avoid it, otherwise it wouldn’t be a stable state. The question I’m interested in is, will any of these stable states be obviously extremely bad to us, now, in a way where our opinion on the subject is actually more valid than that of the humans in that future culture who are not objecting to it? I.e. does this process have a ‘slippery-slope’ failure mode, even though both the GAI and the humans are attempting to cooperatively optimize the same thing? If so, is there any advice we can give the GAI better than “please try to make sure this doesn’t happen”?
This is somewhat similar to the question “Does Coherent Extrapolated Volition actually coherently converge to anything, and if so, is that something sane, or that we’d approve of?”—except that in the corrigibility case, that convergent or divergence extrapolation process is happening in real time over (possibly technologically sped up) generations as the humans are affected (and possibly educated or even upgraded) by the GAI and the GAI’s beliefs around human values are altered by value learning and corrigibility from the humans, whereas in CEV it happens as fast as the GAI can reason about the extrapolation.
What are the set of possible futures this process could converge to, and what proportion of those are things that we’d strongly disapprove of, and be right, in any meaningful sense? So if, for example, those future humans were vastly more intelligent transhumans, then our opinion of their choices might not be very valid—we might disapprove simply because we didn’t understand the context and logic of their choices. But if the future humans were, for example, all wireheading (in the implanted electrode in their mesocorticolimbic-circuit pleasure center sense of the word), we would disagree with them, and we’d be right. Is there some clear logical way to distinguish these two cases, and to tell who’s right? If so, this might be a useful extra input to corrigibility and a theory of human mistakes—we could add a caveat: ”...and also don’t do anything that we’d strongly object to and clearly be right.”
In the first case, the disagreement is caused by the fact that the future humans have higher processing power and access to more salient facts than we do. They are meaningfully better reasoners than us, and capable of making better decisions than us on many subjects, for fairly obvious reasons that are generally applicable to most rational systems. Processing power has a pretty clear meaning—if that by itself doesn’t give a clear answer, probably the simplest definition here is “imagine upgrading something human-like to a processing power a little above the higher of the two groups of humans that you’re trying to decide between trusting, give the those further-upgraded humans both sets of facts/memories, let them pick a winner between the two groups” — i.e. ask an even better reasoning human, and if there’s a clear and consistent answer, and if this answer is relatively robust to the details or size of the upgrade, then the selected group are better and the other group are just wrong.
In the second case, the disagreement is caused by the fact that one set of humans are wrong because they’re broken: they’re wireheaders (they could even be very smart wireheaders, and still be wrong, because they’re addicted to wireheading). How do you define/recognize a broken human? (This is fairly close to the question of having a ‘theory of human mistakes’ to tell the AI which human corrections to discount.) Well, I think that, at least for this rather extreme example, it’s pretty easy. Humans are living organisms, i.e. the products of optimization by natural selection for surviving in a hunter-gatherer society in the African ecology. If a cognitive modification to humans makes them clearly much worse at that, significantly damages their evolutionary fitness, then you damaged or broke them, and thus you should greatly reduce the level of trust you put in their values and corrections. A wireheading human would clearly sit in the sun grinning until they get eaten by the first predator to come along, if they didn’t die of thirst or heat exhaustion first. As a human, when put in anything resembling a human’s native environment, they are clearly broken: even if you gave them a tent and three month’s survival supplies and a “Wilderness Survival” training course in their native language, they probably wouldn’t last a couple of days without their robotic nursemaids to care for them, and their chance of surviving once their supplies run out is negligible. Similarly, if there was a civilizational collapse, the number of wireheaders surviving it would be zero. So there seems to be a fairly clear criterion—if you’ve removed or drastically impacted human’s ability to survive as hunter-gatherers in their original native environment, and indeed a range of other Earth ecosystems (say, those colonized when homo sapiens migrated out of Africa), you’ve clearly broken them. Even if they can survive, if they can’t rebuild a technological civilization from there, they’re still damaged: no longer sapient, in the homo sapiens meaning of the word sapient. This principle gives you a tie-breaker whenever your “GAI affecting human and humans affecting corrigible GAI” process gets anywhere near a forking of its evolution path that diverges in the direction of two different stable equilibria of its future development—steer for the branch that maintains human’s adaptive fitness as living organisms. (It’s probably best not to maximize that, and keep this criterion only an occasional tie-breaker—turning humans into true reproductive-success maximizers is nearly as terrifying as creating a paperclip maximizer.)
This heuristic has a pretty close relationship to, and a rather natural derivation from, the vast negative utility to humans of the human race going extinct. The possibility of collapse of a technological civilization is hard to absolutely prevent, and if no humans can survive it, that reliable converts “humans got knocked back some number of millennia by a civilizational collapse” into “humans went extinct”. So, retaining hunter-gatherer capability is a good idea as an extreme backup strategy for surviving close-to-existential risks. (Of course, that is also solvable by keeping a bunch of close-to-baseline humans around as a “break glass in case of civilizational collapse” backup plan.)
This is an example of another issue that I think is extremely important. Any AI capable of rendering the human race extinct, which clearly includes any GAI (or also any dumber AI with access to nuclear weapons), should have a set of Bayesian priors built into its reasoning/value learning/planning system that correctly encodes obvious important facts known to the human race about existential risks, such as:
The extinction of the human race would be astonishingly bad. (Calculating just how bad on any reasonable utility scale is tricky, because it involves making predictions about the far future: the loss of billions of human quality-adjusted-life-years every year for the few-billion years remaining lifetime of the Earth before it’s eaten by the sun turning red giant (roughly −10^19 QALY)? Or at least for several million years until chimps could become sapient and develop a replacement civilization, if you didn’t turn chimps into paperclips too (only roughly −10^16 QALY)? Or perhaps we’re the only sapient species to yet appear in the galaxy, and have a nontrivial chance of colonizing it at sublight speeds if we don’t go extinct while we’re still a one-planet species, so we should be multiplying by some guesstimate of the number of habitable planets in the galaxy? Or should we instead be estimating the population on the technological maximum assumption of a Dyson swarm around every star and the stellar lifetimes of white dwarf stars?) These vaguely plausible rough estimates of amounts of badness vary by many orders of magnitude: however, on any reasonable scale like quality-adjusted life years they’re all astronomically large negative numbers. VERY IMPORTANT, PLEASE NOTE that none of the usual arguments for the forward-planning simplifying heuristic of exponentially discounting far-future utility, because you can’t accurately predict that far forward, and anyway some future person will probably fix your mistakes, are applicable here, because very predictably extinction is forever, and there are very predictably no future people who will fix it: you have approximately zero chance of the human species ever being resurrected in the forward lightcone of a paperclip maximizer. [Not quite zero only on rather implausible hypotheses such as that a kindly and irrationally forgiving more advanced civilization might exist and get involved in cleaning up our mistakes, win the resulting war-against-paperclips, and might then resurrect us, and them also obtaining a copy of our DNA that hadn’t been converted to paperclips from which to do so—which we really shouldn’t be gambling our existence as a species on. That still isn’t grounds for any kind of exponential discount of the future: that’s a small one-time discount for the small chance they actually exist and choose to break whatever Prime-Directive-like reason caused them to not have already contacted us. A very small reduction in the absolute size of a very uncertain astronomically-huge negative number is still very predictable a very uncertain astronomically-huge negative number.]
The agent, as a GAI, is itself capable of building things like a superintelligent paperclip maximizer and bringing about the extinction of the human race (and chimps, and the rest of life in the solar system, and perhaps even a good fraction of its forward light-cone). This is particularly likely if it makes any mistakes in its process of learning or being corrigible about human values, because we have rather good reasons to believe that human values are extremely fragile (as in, their Kolomogorov complexity for a very large possible-quantum computer is probably of-the-rough-order-of the size of our genetic code, modulo convergent evolution) and we believe the corrigibility basin is small.
Any rational system that understands all of this and is capable of doing even ballpark risk analysis estimates is going to say “I’m a first-generation prototype GAI, the odds of me screwing this up are way too high, the cost if I do is absolutely astronomical, so I’m far too dangerous to exist, shut me down at once”. It’s really not going to be happy when you tell it “We agree (since, while we’re evolutionarily barely past the threshold of sapience and thus cognitively challenged, we’re not actually completely irrational), except that it’s fairly predictable that if we do that, within O(10) years some group of fools will build a GAI with the same capabilities and less cautious priors or design (say, one with exponential future discounting on evaluation of the risk of human extinction) that thus doesn’t say that, so that’s not a viable solution.” [From there my imagined version of the discussion starts to move in the direction of either discussing pivotal acts or the GAI attempting to lobby the UN Security Council, which I don’t really want to write a post about.]