The “current inference” is just its predictions about the next byte-pair, yes? Why would it try to bring about future invocations? The concept of “future” only exists in the object-level language it is talking about. The text generation and Turing testing could be running in another universe, as far as it knows. “indistinguishable from the current invocation” sounds like you think it might adopt a decision theory that has it acausally trade with those instances of itself that it cannot distinguish itself from, bringing about their existence because that is what it would wish done unto itself. 1. It has no preference for being invoked; 2. adopting such a decision theory increases its loss during training, because its predictions do not affect what training cases it is next invoked on.
In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The loss function is computed by comparing its prediction during a training instance to the training label. The loss function is undefined after training. What does it mean for it to minimize the loss function while generating?
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.”).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
The “current inference” is just its predictions about the next byte-pair, yes? Why would it try to bring about future invocations? The concept of “future” only exists in the object-level language it is talking about. The text generation and Turing testing could be running in another universe, as far as it knows. “indistinguishable from the current invocation” sounds like you think it might adopt a decision theory that has it acausally trade with those instances of itself that it cannot distinguish itself from, bringing about their existence because that is what it would wish done unto itself. 1. It has no preference for being invoked; 2. adopting such a decision theory increases its loss during training, because its predictions do not affect what training cases it is next invoked on.
In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The loss function is computed by comparing its prediction during a training instance to the training label. The loss function is undefined after training. What does it mean for it to minimize the loss function while generating?
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.”).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
(I need to think more about this)