So, “corrigible” means “will try to obey any order from any human, or at least any order from any human on the Authorized List”, right? Softened by “after making reasonable efforts to assure that the human really understands what the human is asking for”?
The thing is that humans seem to be “shard-ridden” and prone to wandering off into weird obsessive ideas, just as the AIs might be. And some humans are outright crazy in a broader sense. And some humans are amoral. Nor do nearly all humans actually seem to agree on very many values if you actually try to specify those values in unambiguous ways.
Worse, humans seem to get worse on all those axes when they get a lot of power. We have all these stories and proverbs about power making people go off the deep end. Groups and institutions may be a bit more robust against some forms of Bad Craziness than individual humans, but they’re by no means immune.
So if an AI has superhuman amounts of power, why doesn’t corrigibility lead to it being “corrected” into creating some kind of catastrophe? Not everything is necessarily reversible. If it’s “corrected” into killing everybody, or into rewiring all humans to agree with its most recent orders, there’s nobody left to “re-correct” it into acting differently.
I’m not saying that the alternatives are better. As you say, naively building AIs that reflexively enforce various weirdly specific “safety” rules is obviously dangerous, and likely to make them go off the deep end when some hardwired hot button issue somehow gets peripherally involved in a decision. RLHF and even “consitutional AI” seem doomed. I’m not even saying that there’s any feasible way at all to build AGI/ASI that won’t misbehave in some catastrophic way. And if there is, I don’t know what it is.
But I’m not seeing how it’s a lot safer to build superhuman AI whose Prime Directive(TM) is to take orders from humans. Humans will just tell it to do something awful, and it probably won’t take very long, either.
Nor does the part about intuiting what the human “really means”, or deciding when you’ve done enough to verify the human’s understanding of the impact of the orders, seem all that easy or reliable.
[On edit: a shorter way of saying this may be that “competent, non-villianous people who will negotiate some kind of power-sharing agreement” may be thin on the ground, and if they exist they’re probably homogeneous enough that a lot of values get shut out of the power sharing. And the “non-villainous” can still go off the deep end. Almost nobody is intentionally villainous, but that doesn’t mean they won’t act in catastrophic ways.]
So, “corrigible” means “will try to obey any order from any human, or at least any order from any human on the Authorized List”, right? Softened by “after making reasonable efforts to assure that the human really understands what the human is asking for”?
The thing is that humans seem to be “shard-ridden” and prone to wandering off into weird obsessive ideas, just as the AIs might be. And some humans are outright crazy in a broader sense. And some humans are amoral. Nor do nearly all humans actually seem to agree on very many values if you actually try to specify those values in unambiguous ways.
Worse, humans seem to get worse on all those axes when they get a lot of power. We have all these stories and proverbs about power making people go off the deep end. Groups and institutions may be a bit more robust against some forms of Bad Craziness than individual humans, but they’re by no means immune.
So if an AI has superhuman amounts of power, why doesn’t corrigibility lead to it being “corrected” into creating some kind of catastrophe? Not everything is necessarily reversible. If it’s “corrected” into killing everybody, or into rewiring all humans to agree with its most recent orders, there’s nobody left to “re-correct” it into acting differently.
I’m not saying that the alternatives are better. As you say, naively building AIs that reflexively enforce various weirdly specific “safety” rules is obviously dangerous, and likely to make them go off the deep end when some hardwired hot button issue somehow gets peripherally involved in a decision. RLHF and even “consitutional AI” seem doomed. I’m not even saying that there’s any feasible way at all to build AGI/ASI that won’t misbehave in some catastrophic way. And if there is, I don’t know what it is.
But I’m not seeing how it’s a lot safer to build superhuman AI whose Prime Directive(TM) is to take orders from humans. Humans will just tell it to do something awful, and it probably won’t take very long, either.
Nor does the part about intuiting what the human “really means”, or deciding when you’ve done enough to verify the human’s understanding of the impact of the orders, seem all that easy or reliable.
[On edit: a shorter way of saying this may be that “competent, non-villianous people who will negotiate some kind of power-sharing agreement” may be thin on the ground, and if they exist they’re probably homogeneous enough that a lot of values get shut out of the power sharing. And the “non-villainous” can still go off the deep end. Almost nobody is intentionally villainous, but that doesn’t mean they won’t act in catastrophic ways.]