the Oracle could still hack the grading system to give itself a zero loss
Gradient descent won’t optimize for this behavior though, it really seems like you want to study this under inner alignment. (It’s hard for me to see how you can meaningfully consider the problems separately.)
Yes, if the oracle gives itself zero loss by hacking the grading system then it will stop being updated, but the same is true if the mesa-optimizer tampers with the outer training process in any other way, or just copies itself to a different substrate, or whatever.
Here’s my understanding and elaboration of your first paragraph, to make sure I understand it correctly and to explain it to others who might not:
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss). We don’t want a mesa-optimizer that tries to minimize the output of the physical grading system (i.e., the computed loss). The latter kind of model will hack the grading system if given a chance, while the former won’t. However these two models would behave identically in any training episode where reward hacking doesn’t occur, so we can’t distinguish between them without using some kind of inner alignment technique (which might for example look at how the models work on the inside rather than just how they behave).
If we can solve this inner alignment problem then it fully addresses TurnTrout’s (Alex Turner’s) concern, because the system would no longer be “looking for ways to hurt you.”
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss).
Hmm, actually this still doesn’t fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it’s in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there’s a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?
Gradient descent won’t optimize for this behavior though, it really seems like you want to study this under inner alignment. (It’s hard for me to see how you can meaningfully consider the problems separately.)
Yes, if the oracle gives itself zero loss by hacking the grading system then it will stop being updated, but the same is true if the mesa-optimizer tampers with the outer training process in any other way, or just copies itself to a different substrate, or whatever.
Here’s my understanding and elaboration of your first paragraph, to make sure I understand it correctly and to explain it to others who might not:
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss). We don’t want a mesa-optimizer that tries to minimize the output of the physical grading system (i.e., the computed loss). The latter kind of model will hack the grading system if given a chance, while the former won’t. However these two models would behave identically in any training episode where reward hacking doesn’t occur, so we can’t distinguish between them without using some kind of inner alignment technique (which might for example look at how the models work on the inside rather than just how they behave).
If we can solve this inner alignment problem then it fully addresses TurnTrout’s (Alex Turner’s) concern, because the system would no longer be “looking for ways to hurt you.”
Hmm, actually this still doesn’t fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it’s in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there’s a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?