But if the agent is repeatedly carrying out its commitment to fail, then there’ll be pretty strong pressure from gradient descent to change that. What changes might that pressure lead to? The two most salient options to me:
The agent’s commitment to carrying out gradient hacking is reduced.
The agent’s ability to notice changes implemented by gradient descent is reduced.
In a gradient hacking scenario, we should expect the malicious conditionally-fail-on-purpose logic to be optimized for such outcomes not to occur. For example, the malicious logic may involve redundancy: Suppose there are two different conditionally-fail-on-purpose logic pieces in the network such that each independently make the model fail if the x-component is large. Due to the redundancy, a potential failure should have almost no influence on the gradient components that are associated with the weights of the malicious logic pieces. (This is similar to the idea in this comment from a previous discussion we had.)
What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.
The two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic “ruins” a different critical activation value).
I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.
Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed?
It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.
In a gradient hacking scenario, we should expect the malicious conditionally-fail-on-purpose logic to be optimized for such outcomes not to occur. For example, the malicious logic may involve redundancy: Suppose there are two different conditionally-fail-on-purpose logic pieces in the network such that each independently make the model fail if the x-component is large. Due to the redundancy, a potential failure should have almost no influence on the gradient components that are associated with the weights of the malicious logic pieces. (This is similar to the idea in this comment from a previous discussion we had.)
What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.
The two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic “ruins” a different critical activation value).
In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.
I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.
Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.
Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed? It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.