If the AI wants (in the usual sense of the word) to achieve things for which bad aggression is instrumentally useful, and you (arguendo) completely prevent it from using bad aggression, you have barely slowed it down. It will still achieve the things it wants almost as fast, just using social manipulation or whatever instead.
The solution is, as ever, to make an AI that wants to achieve good things and not bad things.
Designating some things as off-limits is a frame for defining agent behavior that’s different from frames that emphasize goals. The point is to sufficiently deconfuse this perspective so that we can train AIs that don’t circumvent boundaries.
An optimization frame says that this always happens, instrumentally valuable things always happen, unless agent’s values are very particular about avoiding them. But an agent doesn’t have to be primarily an optimizer, it could instead be primarily a boundary-preserver, and only incidentally an optimizer.
If the AI wants (in the usual sense of the word) to achieve things for which bad aggression is instrumentally useful, and you (arguendo) completely prevent it from using bad aggression, you have barely slowed it down. It will still achieve the things it wants almost as fast, just using social manipulation or whatever instead.
The solution is, as ever, to make an AI that wants to achieve good things and not bad things.
Designating some things as off-limits is a frame for defining agent behavior that’s different from frames that emphasize goals. The point is to sufficiently deconfuse this perspective so that we can train AIs that don’t circumvent boundaries.
An optimization frame says that this always happens, instrumentally valuable things always happen, unless agent’s values are very particular about avoiding them. But an agent doesn’t have to be primarily an optimizer, it could instead be primarily a boundary-preserver, and only incidentally an optimizer.