If the proposal is “just make the grading system really hard to hack”, that runs into the same problems as boxing—you’re still running a system that is looking for ways to hurt you.
Are you saying that even if we store the Oracle’s output in some memory buffer with minimal processing, and then compute the loss against the independently generated training label using a simple distance metric, the Oracle could still hack the grading system to give itself a zero loss? Hmm, maybe, but doesn’t the risk of that seem a lot smaller than if these precautions are not taken (i.e., if we’re using reinforcement learning or SL but generating training labels after looking at the Oracle’s output)? Or are these risks actually comparable in magnitude in your mind? Or are you saying that the risk is still unacceptably large and we should reduce it further if we can?
the Oracle could still hack the grading system to give itself a zero loss
Gradient descent won’t optimize for this behavior though, it really seems like you want to study this under inner alignment. (It’s hard for me to see how you can meaningfully consider the problems separately.)
Yes, if the oracle gives itself zero loss by hacking the grading system then it will stop being updated, but the same is true if the mesa-optimizer tampers with the outer training process in any other way, or just copies itself to a different substrate, or whatever.
Here’s my understanding and elaboration of your first paragraph, to make sure I understand it correctly and to explain it to others who might not:
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss). We don’t want a mesa-optimizer that tries to minimize the output of the physical grading system (i.e., the computed loss). The latter kind of model will hack the grading system if given a chance, while the former won’t. However these two models would behave identically in any training episode where reward hacking doesn’t occur, so we can’t distinguish between them without using some kind of inner alignment technique (which might for example look at how the models work on the inside rather than just how they behave).
If we can solve this inner alignment problem then it fully addresses TurnTrout’s (Alex Turner’s) concern, because the system would no longer be “looking for ways to hurt you.”
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss).
Hmm, actually this still doesn’t fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it’s in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there’s a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?
Are you saying that even if we store the Oracle’s output in some memory buffer with minimal processing, and then compute the loss against the independently generated training label using a simple distance metric, the Oracle could still hack the grading system to give itself a zero loss? Hmm, maybe, but doesn’t the risk of that seem a lot smaller than if these precautions are not taken (i.e., if we’re using reinforcement learning or SL but generating training labels after looking at the Oracle’s output)? Or are these risks actually comparable in magnitude in your mind? Or are you saying that the risk is still unacceptably large and we should reduce it further if we can?
Gradient descent won’t optimize for this behavior though, it really seems like you want to study this under inner alignment. (It’s hard for me to see how you can meaningfully consider the problems separately.)
Yes, if the oracle gives itself zero loss by hacking the grading system then it will stop being updated, but the same is true if the mesa-optimizer tampers with the outer training process in any other way, or just copies itself to a different substrate, or whatever.
Here’s my understanding and elaboration of your first paragraph, to make sure I understand it correctly and to explain it to others who might not:
What we want the training process to produce is a mesa-optimizer that tries to minimize the actual distance between its output and the training label (i.e., the actual loss). We don’t want a mesa-optimizer that tries to minimize the output of the physical grading system (i.e., the computed loss). The latter kind of model will hack the grading system if given a chance, while the former won’t. However these two models would behave identically in any training episode where reward hacking doesn’t occur, so we can’t distinguish between them without using some kind of inner alignment technique (which might for example look at how the models work on the inside rather than just how they behave).
If we can solve this inner alignment problem then it fully addresses TurnTrout’s (Alex Turner’s) concern, because the system would no longer be “looking for ways to hurt you.”
Hmm, actually this still doesn’t fully address TurnTrout’s (Alex Turner’s) concern, because this mesa-optimizer could try to minimize the actual distance between its output and the training label by changing the training label (what that means depends on how the training label is defined within its utility function). To do that it would have to break out of the box that it’s in, which may not be possible, but this is still a system that is “looking for ways to hurt you.” It seems that what we really want is a mesa-optimizer that tries to minimize the actual loss while pretending that it has no causal influence on the training label (even if it actually does because there’s a way to break out of its box).
This seems like a harder inner alignment problem than I thought, because we have to make the training process converge upon a rather unnatural kind of agent. Is this still a feasible inner alignment problem to solve, and if not is there another way to get around this problem?