Imagine you’re given a network that has two identical submodules, and some kind of combining function where if it detects the outputs from both submodules are the same it passes the value through but if they’re different it explodes and outputs zero or something totally random, or whatever. This is a natural idea to come up with if your goal is to ensure the optimizer doesn’t mess with these modules, for example, if you’re trying to protect the mesaobjective encoding for gradient hacking.
I think this is a wrong interpretation of the idea that I described in this comment (which your linked comment here is a reply to). There need not be a “dedicated” piece of logic that does nothing other than checking whether the outputs from two subnetworks satisfy some condition and making the model “fail” otherwise. Having such a dedicated piece of logic wouldn’t work because SGD would just remove it. Instead, suppose that the model depends on two different computations, X and Y, for the purpose of minimizing its loss. Now suppose there are two malicious pieces of logic, one within the subnetwork that computes X and one within the subnetwork that computes Y. If a certain condition about the input of the entire network is satisfied, the malicious logic pieces make both X and Y fail. Albeit doing so, the gradient components of the weights that are associated with the malicious pieces of logic are close to zero (putting aside regularization), because changing any single weight has almost no effect on the loss.
Having such a dedicated piece of logic wouldn’t work because SGD would just remove it.
My formulation is broad enough that it doesn’t have to be a dedicated piece of logic, there just has to be some way of looking at the reset of the network that depends on X and Y being the same.
because changing any single weight has almost no effect on the loss.
This is what I take issue with—if there is a way to change both components simultaneously to have an effect on the loss, SGD will happily do that.
My formulation is broad enough that it doesn’t have to be a dedicated piece of logic, there just has to be some way of looking at the reset of the network that depends on X and Y being the same.
But X and Y are not the same! For example, if the model is intended to classify images of animals, the computation X may correspond to [how many legs does the animal have?] and Y may correspond to [how large is the animal?]
This is what I take issue with—if there is a way to change both components simultaneously to have an effect on the loss, SGD will happily do that.
This seems to me wrong. SGD updates the weights in the direction of the gradient, and if changing a given weight alone does not affect the loss then the gradient component that is associated with that weight will be 0 and thus SGD will not change that weight.
SGD updates the weights in the direction of the gradient, and if changing a given weight alone does not affect the loss then the gradient component that is associated with that weight will be 0 and thus SGD will not change that weight.
If the partial derivative wrt two different parameters is zero, i.e ∂f∂θ1=∂f∂θ2=0, then it must be that changing both simultaneously does not change the loss either (to be precise, limh→0f(x+h(θ1+θ2))h=0).
I don’t see how this is relevant here. If it is the case that changing only w1 does not affect the loss, and changing only w2 does not affect the loss, then SGD would not change them (their gradient components will be zero), even if changing them both can affect the loss.
It’s relevant because it demonstrates that in differentiable functions, if it is the case that changing only w1 does not affect the loss, and changing only w2 does not affect the loss, then it is not possible that changing them both can affect the loss either.
I think this is a wrong interpretation of the idea that I described in this comment (which your linked comment here is a reply to). There need not be a “dedicated” piece of logic that does nothing other than checking whether the outputs from two subnetworks satisfy some condition and making the model “fail” otherwise. Having such a dedicated piece of logic wouldn’t work because SGD would just remove it. Instead, suppose that the model depends on two different computations, X and Y, for the purpose of minimizing its loss. Now suppose there are two malicious pieces of logic, one within the subnetwork that computes X and one within the subnetwork that computes Y. If a certain condition about the input of the entire network is satisfied, the malicious logic pieces make both X and Y fail. Albeit doing so, the gradient components of the weights that are associated with the malicious pieces of logic are close to zero (putting aside regularization), because changing any single weight has almost no effect on the loss.
My formulation is broad enough that it doesn’t have to be a dedicated piece of logic, there just has to be some way of looking at the reset of the network that depends on X and Y being the same.
This is what I take issue with—if there is a way to change both components simultaneously to have an effect on the loss, SGD will happily do that.
But X and Y are not the same! For example, if the model is intended to classify images of animals, the computation X may correspond to [how many legs does the animal have?] and Y may correspond to [how large is the animal?]
This seems to me wrong. SGD updates the weights in the direction of the gradient, and if changing a given weight alone does not affect the loss then the gradient component that is associated with that weight will be 0 and thus SGD will not change that weight.
If the partial derivative wrt two different parameters is zero, i.e ∂f∂θ1=∂f∂θ2=0, then it must be that changing both simultaneously does not change the loss either (to be precise, limh→0f(x+h(θ1+θ2))h=0).
I don’t see how this is relevant here. If it is the case that changing only w1 does not affect the loss, and changing only w2 does not affect the loss, then SGD would not change them (their gradient components will be zero), even if changing them both can affect the loss.
It’s relevant because it demonstrates that in differentiable functions, if it is the case that changing only w1 does not affect the loss, and changing only w2 does not affect the loss, then it is not possible that changing them both can affect the loss either.