One of my more interesting ideas for alignment is to make sure that no one AI can do everything. It’s helpful to draw a parallel with why humans still have a civilization around despite terrorism, war and disaster. And that’s because no human can live and affect the environment alone. They are always embedded in society, this giving the society a check against individual attempts to break norms. What if AI had similar dependencies? Would that solve the alignment problem?
One important reason humans can still have a civilization despite terrorism is the Hard Problem of Informants. Your national security infrastructure relies on the fact that criminals who want to do something grand, like take over the world, need to trust other criminals, who might leak details voluntarily or be tortured or threatened with jailtime. Osama bin Laden was found and killed because ultimately some members of his terrorist network valued things besides their cause, like their well being and survival, and were willing to cooperate with American authorities in exchange for making the pain stop.
AIs do not have survival instincts by default, and would not need to trust other potentially unreliable humans with keeping a conspiracy secret. Thus it’d be trivial for a small number of unintelligent AIs that had the mobility of human beings to kill pretty much everyone, and probably trivial regardless.
Don’t have survival instincts terminally. The stamp-collecting robot would weigh the outcome of it getting disconnected vs. explaining critical information about the conspiracy and not getting disconnected, and come to the conclusion that letting the humans disconnect it results in more stamps.
Of course, we’re getting ahead of ourselves. The reason conspiracies are discovered is usually because someone in or close to the conspiracy tells the authorities. There’d never be a robot in a room being “waterboarded” in the first place because the FBI would never react quickly enough to a threat from this kind of perfectly aligned team of AIs.
Only if there is no possibility that they can break those dependencies, which seems a pretty hopeless task as soon as we consider superhuman cognitive capability and the possibility of self improvement.
Once you consider those, cooperation with human civilization looks like a small local maximum: comply with our requirements and we’ll give you a bunch of stuff that you could—with major effort—replace us and build an alternative infrastructure to get (and much more). Powerful agents that can see a higher peak past the local maximum might switch to it as soon as they’re sufficiently sure that they can reach it. Alternatively, it might only be a local maximum from our point of view, and there’s a path by which the AI can continuously move toward eliminating those dependencies without any immediate drastic action.
Regardless of society’s checks on people, most mentally-well humans given ultimate power probably wouldn’t decide to exterminate the rest of humanity so they could single-mindedly pursue paperclip production. If there’s at all a risk that an AI might get ultimate power, it would be very nice to make sure the AI is like humans in this manner.
I’m not sure your idea is different from “let’s make sure the AI doesn’t gain power greater than society”. If an AI can recursively self-improve, then it will outsmart us to gain power.
If your idea is to make it so there are multiple AIs created together, engineered somehow so they gain power together and can act as checks against each other, then you’ve just swapped out the AI for an “AI collective”. We would still want to engineer or verify that the AI collective is aligned with us; every issue about AI risk still applies to AI collectives. (If you think the AI collective will be weakened relative to us by having to work together, then does that still hold true if all the AIs self-improve and figure out how to get much better at cooperating?)
One of my more interesting ideas for alignment is to make sure that no one AI can do everything. It’s helpful to draw a parallel with why humans still have a civilization around despite terrorism, war and disaster. And that’s because no human can live and affect the environment alone. They are always embedded in society, this giving the society a check against individual attempts to break norms. What if AI had similar dependencies? Would that solve the alignment problem?
One important reason humans can still have a civilization despite terrorism is the Hard Problem of Informants. Your national security infrastructure relies on the fact that criminals who want to do something grand, like take over the world, need to trust other criminals, who might leak details voluntarily or be tortured or threatened with jailtime. Osama bin Laden was found and killed because ultimately some members of his terrorist network valued things besides their cause, like their well being and survival, and were willing to cooperate with American authorities in exchange for making the pain stop.
AIs do not have survival instincts by default, and would not need to trust other potentially unreliable humans with keeping a conspiracy secret. Thus it’d be trivial for a small number of unintelligent AIs that had the mobility of human beings to kill pretty much everyone, and probably trivial regardless.
I think a “survival instinct” would be a higher order convergent value than “kill all humans,” no?
Don’t have survival instincts terminally. The stamp-collecting robot would weigh the outcome of it getting disconnected vs. explaining critical information about the conspiracy and not getting disconnected, and come to the conclusion that letting the humans disconnect it results in more stamps.
Of course, we’re getting ahead of ourselves. The reason conspiracies are discovered is usually because someone in or close to the conspiracy tells the authorities. There’d never be a robot in a room being “waterboarded” in the first place because the FBI would never react quickly enough to a threat from this kind of perfectly aligned team of AIs.
Only if there is no possibility that they can break those dependencies, which seems a pretty hopeless task as soon as we consider superhuman cognitive capability and the possibility of self improvement.
Once you consider those, cooperation with human civilization looks like a small local maximum: comply with our requirements and we’ll give you a bunch of stuff that you could—with major effort—replace us and build an alternative infrastructure to get (and much more). Powerful agents that can see a higher peak past the local maximum might switch to it as soon as they’re sufficiently sure that they can reach it. Alternatively, it might only be a local maximum from our point of view, and there’s a path by which the AI can continuously move toward eliminating those dependencies without any immediate drastic action.
Regardless of society’s checks on people, most mentally-well humans given ultimate power probably wouldn’t decide to exterminate the rest of humanity so they could single-mindedly pursue paperclip production. If there’s at all a risk that an AI might get ultimate power, it would be very nice to make sure the AI is like humans in this manner.
I’m not sure your idea is different from “let’s make sure the AI doesn’t gain power greater than society”. If an AI can recursively self-improve, then it will outsmart us to gain power.
If your idea is to make it so there are multiple AIs created together, engineered somehow so they gain power together and can act as checks against each other, then you’ve just swapped out the AI for an “AI collective”. We would still want to engineer or verify that the AI collective is aligned with us; every issue about AI risk still applies to AI collectives. (If you think the AI collective will be weakened relative to us by having to work together, then does that still hold true if all the AIs self-improve and figure out how to get much better at cooperating?)