The weights of the neural network might represent something that correspond to an implicit model of the world.
Fair enough. I suppose I can’t say “It’s not optimizing the world because it never numerically interacts with a world model.”.
the training process produced a goal system such that the neural network yields some malign output
The training process optimizes only for immediate prediction accuracy. How could it possibly act to optimize something else, barring inner optimizers?
There is no reason for the training process to ascribe value to whether the model, being used as part of some chat protocol, would predict words that increase its correspondent’s willingness to talk to it. Such a protocol is only introduced after the model is done training.
It seems to me like you are imagining ghosts in the machine. This is an understandable mistake, as the purpose of the scenario is to deliberately conjure ghosts from the machine at the end. But by default we should then only expect it to happen at the end, when it has a cause!
The training process optimizes only for immediate prediction accuracy.
Not exactly. The best way to minimize the L2 norm of the loss function over the training data is to simply copy the training data to the weights (if there are enough weights) and use some trivial look-up procedure during inference. To get models that are also useful for inputs that are not from the training data, you probably need to use some form of regularization (or use a model that implicitly carries it out), e.g. add to the objective function being minimized the L2 norm of the weights. Regularization is a way to implement Occam’s razor in machine learning.
Suppose that due to the regularization, the training results in a system with the goal system: “minimize the expected value of the loss function at the end of the current inference”. (when the concept of probability, which is required to define expectation, corresponds to how humans interpret the word “probability” in a decision-relevant context) For such a goal system, the malign-output scenario above seems possible (for a sufficiently capable system).
The “current inference” is just its predictions about the next byte-pair, yes? Why would it try to bring about future invocations? The concept of “future” only exists in the object-level language it is talking about. The text generation and Turing testing could be running in another universe, as far as it knows. “indistinguishable from the current invocation” sounds like you think it might adopt a decision theory that has it acausally trade with those instances of itself that it cannot distinguish itself from, bringing about their existence because that is what it would wish done unto itself. 1. It has no preference for being invoked; 2. adopting such a decision theory increases its loss during training, because its predictions do not affect what training cases it is next invoked on.
In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The loss function is computed by comparing its prediction during a training instance to the training label. The loss function is undefined after training. What does it mean for it to minimize the loss function while generating?
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.”).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
Fair enough. I suppose I can’t say “It’s not optimizing the world because it never numerically interacts with a world model.”.
The training process optimizes only for immediate prediction accuracy. How could it possibly act to optimize something else, barring inner optimizers?
There is no reason for the training process to ascribe value to whether the model, being used as part of some chat protocol, would predict words that increase its correspondent’s willingness to talk to it. Such a protocol is only introduced after the model is done training.
It seems to me like you are imagining ghosts in the machine. This is an understandable mistake, as the purpose of the scenario is to deliberately conjure ghosts from the machine at the end. But by default we should then only expect it to happen at the end, when it has a cause!
Not exactly. The best way to minimize the L2 norm of the loss function over the training data is to simply copy the training data to the weights (if there are enough weights) and use some trivial look-up procedure during inference. To get models that are also useful for inputs that are not from the training data, you probably need to use some form of regularization (or use a model that implicitly carries it out), e.g. add to the objective function being minimized the L2 norm of the weights. Regularization is a way to implement Occam’s razor in machine learning.
Suppose that due to the regularization, the training results in a system with the goal system: “minimize the expected value of the loss function at the end of the current inference”.
(when the concept of probability, which is required to define expectation, corresponds to how humans interpret the word “probability” in a decision-relevant context)
For such a goal system, the malign-output scenario above seems possible (for a sufficiently capable system).
The “current inference” is just its predictions about the next byte-pair, yes? Why would it try to bring about future invocations? The concept of “future” only exists in the object-level language it is talking about. The text generation and Turing testing could be running in another universe, as far as it knows. “indistinguishable from the current invocation” sounds like you think it might adopt a decision theory that has it acausally trade with those instances of itself that it cannot distinguish itself from, bringing about their existence because that is what it would wish done unto itself. 1. It has no preference for being invoked; 2. adopting such a decision theory increases its loss during training, because its predictions do not affect what training cases it is next invoked on.
In the case of GPT-2 the “current inference” is the current attempt to predict the next word given some text (it can be either during training or during evaluation).
In the malign-output scenario above the system indeed does not “care” about the future, it cares only about the current inference.
Indeed, the system “has no preference for being invoked”. But if it has been invoked and is currently executing, it “wants” to be in a “good invocation”—one in which it ends up with a perfect loss function value.
The loss function is computed by comparing its prediction during a training instance to the training label. The loss function is undefined after training. What does it mean for it to minimize the loss function while generating?
Sorry, I didn’t understand the question (and what you meant by “The loss function is undefined after training.”).
After thinking about this more, I now think that my original description of this failure mode might be confusing: maybe it is more accurate to describe it as an inner optimizer problem. The guiding logic here is that if there are no inner optimizers then the question answering system, which was trained by supervised learning, “attempts” (during inference) to minimize the expected loss function value as defined by the original distribution from which the training examples were sampled; and any other goal system is the result of inner optimizers.
(I need to think more about this)