Ofer comments on Thoughts on gradient hacking

Ofer 5 Oct 2021 19:05 UTC
LW: 1 AF: 1
AF
The two pieces of logic can use the same activation values as their input. For example, suppose they both (independently) cause failure if a certain activation value is above some threshold. (In which case each piece of logic “ruins” a different critical activation value).
- Richard_Ngo 21 Jun 2022 0:26 UTC
  LW: 2 AF: 2
  AF Parent
  In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.
- Not Relevant 27 Apr 2022 18:24 UTC
  LW: 2 AF: 1
  AF Parent
  I confess I’m confused as to how the network gradient hacks to create these redundant mechanisms in the first place? Since one of them seems completely useless unless designed deliberately.
  - Ofer 27 Apr 2022 20:03 UTC
    LW: 1 AF: 1
    AF Parent
    Suppose that each subnetwork does general reasoning and thus up until some point during training the subnetworks are useful for minimizing loss.
    - Not Relevant 27 Apr 2022 20:19 UTC
      1 point
      Parent
      Are you saying that such a mechanism occurs by coincidence, or that it’s actively constructed? It seems like for all the intermediate steps, all consumers of the almost-identical subnetworks would naturally just pick one and use that one, since it was slightly closer to what the consumer needed.