What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss).
Hmm, actually this still doesn’t fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it’s in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there’s a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?
Hmm, actually this still doesn’t fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it’s in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there’s a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?