Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated “If going to kill people, then don’t” value shard.
Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:
A baby learns “IF juice in front of me, THEN drink”,
The baby is later near juice, and then turns to see it, activating the learned “reflex” heuristic, learning to turn around and look at juice when the juice is nearby,
The baby is later far from juice, and bumbles around until they’re near the juice, whereupon she drinks the juice via the existing heuristics. This teaches “navigate to juice when you know it’s nearby.”
Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
...
The juice shard chains into itself, reinforcing itself across time and thought-steps.
But a “don’t kill” shard seems like it should remain… stubby? Primitive? It can’t self-chain into not doing something. If you’re going to do it, and then don’t because of the don’t-kill shard, and that avoids negative reward… Then maybe the “don’t kill” shard gets reinforced and generalized a bit because it avoided negative reward.
But—on my current guesses and intuitions—that shard doesn’t become more sophisticated, it doesn’t become reflective, it doesn’t “agentically participate” in the internal shard politics (e.g. the agent’s “meta-ethics”, deciding what kind of agent it “wants to become”). Other parts of the agent want things, they want paperclips or whatever, and that’s harder to do if the agent isn’t allowed to kill anyone.
Crucially, the no-killing injunction can probably be steered around by the agent’s other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard… There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might surreptitiously bid up lesioning plans which are optimized so as to not activate the reflective world-model, and thus, not activate the no-killing shard.
So, don’t embed a shard which doesn’t want to kill. Make a shard which wants to protect / save / help people. That can chain into itself across time.
See also:
Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why.
This is one point in favor of the “convergent consequentialism” hypothesis, in some form.
I think that people are not usually defined by negative values (e.g. “don’t kill”), but by positives, and perhaps this is important.
I strongly agree that self-seeking mechanisms are more able to maintain themselves than self-avoiding mechanisms. Please post this as a top-level post.
Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.
This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).
This asymmetry makes a lot of sense from an efficiency standpoint. No sense wasting your limited storage/computation on state(-action pair)s that you are also simultaneously preventing yourself from encountering.
Positive values seem more robust and lasting than prohibitions. Imagine we train an AI on realistic situations where it can kill people, and penalize it when it does so. Suppose that we successfully instill a strong and widely activated “If going to kill people, then don’t” value shard.
Even assuming this much, the situation seems fragile. See, many value shards are self-chaining. In The shard theory of human values, I wrote about how:
A baby learns “IF juice in front of me, THEN drink”,
The baby is later near juice, and then turns to see it, activating the learned “reflex” heuristic, learning to turn around and look at juice when the juice is nearby,
The baby is later far from juice, and bumbles around until they’re near the juice, whereupon she drinks the juice via the existing heuristics. This teaches “navigate to juice when you know it’s nearby.”
Eventually this develops into a learned planning algorithm incorporating multiple value shards (e.g. juice and friends) so as to produce a single locally coherent plan.
...
The juice shard chains into itself, reinforcing itself across time and thought-steps.
But a “don’t kill” shard seems like it should remain… stubby? Primitive? It can’t self-chain into not doing something. If you’re going to do it, and then don’t because of the don’t-kill shard, and that avoids negative reward… Then maybe the “don’t kill” shard gets reinforced and generalized a bit because it avoided negative reward.
But—on my current guesses and intuitions—that shard doesn’t become more sophisticated, it doesn’t become reflective, it doesn’t “agentically participate” in the internal shard politics (e.g. the agent’s “meta-ethics”, deciding what kind of agent it “wants to become”). Other parts of the agent want things, they want paperclips or whatever, and that’s harder to do if the agent isn’t allowed to kill anyone.
Crucially, the no-killing injunction can probably be steered around by the agent’s other values. While the obvious route of lesioning the no-killing shard might be reflectively-predicted by the world model to lead to more murder, and therefore bid against by the no-killing shard… There are probably ways to get around this obstacle. Other value shards (e.g. paperclips and cow-breeding) might surreptitiously bid up lesioning plans which are optimized so as to not activate the reflective world-model, and thus, not activate the no-killing shard.
So, don’t embed a shard which doesn’t want to kill. Make a shard which wants to protect / save / help people. That can chain into itself across time.
See also:
Deontology seems most durable to me when it can be justified on consequentialist grounds. Perhaps this is one mechanistic reason why.
This is one point in favor of the “convergent consequentialism” hypothesis, in some form.
I think that people are not usually defined by negative values (e.g. “don’t kill”), but by positives, and perhaps this is important.
I strongly agree that self-seeking mechanisms are more able to maintain themselves than self-avoiding mechanisms. Please post this as a top-level post.
Seems possibly relevant & optimistic when seeing deception as a value. It has the form ‘if about to tell human statement with properties x, y, z, don’t’ too.
It can still be robustly derived as an instrumental subgoal during general-planning/problem-solving, though?
This is true, but indicates a radically different stage in training in which we should find deception compared to deception being an intrinsic value. It also possibly expands the kinds of reinforcement schedules we may want to use compared to the worlds where deception crops up at the earliest opportunity (though pseudo-deception may occur, where behaviors correlated with successful deception are reinforced possibly?).
Oh, huh, I had cached the impression that deception would be derived, not intrinsic-value status. Interesting.
This asymmetry makes a lot of sense from an efficiency standpoint. No sense wasting your limited storage/computation on state(-action pair)s that you are also simultaneously preventing yourself from encountering.