The original footnote provides one example of this, which is for the model to check if its objective satisfies some criterion, and fail hard if it doesn’t. Now, if the model gets to the point where it’s actually just failing because of this, then gradient descent will probably just remove that check—but the trick is never to actually get there. By having such a check in the first place, the model makes it so that gradient descent won’t actually change its objective,
There is a direction in the gradient which both changes the objective and removes that check. The model doesn’t need to be actually failing for that to happen—there is one piece of its parameters encoding the objective, and another piece encoding this deceptive logic which checks the objective and decides to perform poorly if it doesn’t pass the check. The gradient can update both at the same time—any slight change to the objective can also be applied to the deceptive objective detector, making the model not misbehave and still updating its objective.
since any change to its objective (keeping all the other parameters fixed, which is what gradient descent does since it computes partial derivatives) would lead to such a failure.
I don’t think this is possible. The directional derivative is the dot product of the gradient and the direction vector. That means if the loss decreases in a certain direction (in this case, changing the deceptive behavior and the objective), at least one of the partial derivatives is negative, so gradient descent can move in that direction and we will get what we want (different objective or less deception or both).
There is a direction in the gradient which both changes the objective and removes that check. The model doesn’t need to be actually failing for that to happen—there is one piece of its parameters encoding the objective, and another piece encoding this deceptive logic which checks the objective and decides to perform poorly if it doesn’t pass the check. The gradient can update both at the same time—any slight change to the objective can also be applied to the deceptive objective detector, making the model not misbehave and still updating its objective.
I don’t think this is possible. The directional derivative is the dot product of the gradient and the direction vector. That means if the loss decreases in a certain direction (in this case, changing the deceptive behavior and the objective), at least one of the partial derivatives is negative, so gradient descent can move in that direction and we will get what we want (different objective or less deception or both).