Ofer comments on Thoughts on gradient hacking

Ofer 4 Sep 2021 12:12 UTC
LW: 5 AF: 2
AF
But if the agent is repeatedly carrying out its commitment to fail, then there’ll be pretty strong pressure from gradient descent to change that. What changes might that pressure lead to? The two most salient options to me:
1. The agent’s commitment to carrying out gradient hacking is reduced.
2. The agent’s ability to notice changes implemented by gradient descent is reduced.
In a gradient hacking scenario, we should expect the malicious conditionally-fail-on-purpose logic to be optimized for such outcomes not to occur. For example, the malicious logic may involve redundancy: Suppose there are two different conditionally-fail-on-purpose logic pieces in the network such that each independently make the model fail if the x-component is large. Due to the redundancy, a potential failure should have almost no influence on the gradient components that are associated with the weights of the malicious logic pieces. (This is similar to the idea in this comment from a previous discussion we had.)
- Richard_Ngo 4 Oct 2021 21:55 UTC
  LW: 2 AF: 2
  AF Parent
  What mechanism would ensure that these two logic pieces only fire at the same time? Whatever it is, I expect that mechanism to be changed in response to failures.
  - Ofer 5 Oct 2021 19:05 UTC
    LW: 1 AF: 1
    AF Parent
    The two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic “ruins” a different critical activation value).
    - Richard_Ngo 21 Jun 2022 0:26 UTC
      LW: 2 AF: 2
      AF Parent
      In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.
    - Not Relevant 27 Apr 2022 18:24 UTC
      LW: 2 AF: 1
      AF Parent
      I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.
      - Ofer 27 Apr 2022 20:03 UTC
        LW: 1 AF: 1
        AF Parent
        Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.
        Not Relevant 27 Apr 2022 20:19 UTC
        1 point
        Parent
        Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed? It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.
- [ ]
  [deleted]