In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The loss function is computed by comparing its prediction during a training instance to the training label. The loss function is undefined after training. What does it mean for it to minimize the loss function while generating?
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.”).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The loss function is computed by comparing its prediction during a training instance to the training label. The loss function is undefined after training. What does it mean for it to minimize the loss function while generating?
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.”).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
(I need to think more about this)