Richard_Ngo comments on Gradient hacking

Richard_Ngo 30 Jan 2020 11:31 UTC
LW: 4 AF: 2
AF
I wasn’t claiming that there’ll be an explicit OR gate, just something functionally equivalent to it. To take a simple case, imagine that the two subnetworks output a real number each, which are multiplied together to get a final output, which we can interpret as the agent’s reward (there’d need to be some further module which chooses behaviours in order to get that much reward, but let’s ignore it for now). Each of the submodules’ outputs measures how much that subnetwork thinks the agent’s original goal has been preserved. Suppose that normally both subnetworks output 1, and then they switch to outputting 0 when they think they’ve passed the threshold of corruption, which makes the agent get 0 reward.
I agree that, at this point, there’s no gradient signal to change the subnetworks. My points are that:
1. There’s still a gradient signal to change the OR gate (in this case, the implementation of multiplication).
2. Consider how they got to the point of outputting 0. They must have been decreasing from 1 as the overall network changed. So as the network changed, and they started producing outputs less than 1, there’d be pressure to modify them.
3. The point above isn’t true if the subnetworks go from 1 to 0 within one gradient step. In that case, the network will likely either bounce back and forth across the threshold (eroding the OR gate every time it does so) or else remain very close to the threshold (since there’s no penalty for doing so). But since the transition from 1 to 0 needs to be continuous at *some* resolution, staying very *very* close to the threshold will produce subnetwork output somewhere between 0 and 1, which creates pressure for the subnetworks to be less accurate.
4. 4. It’s non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain? (Independence is very tricky because they’re part of the same plan, and so a change in your underlying motivation to pursue that plan affects both). Ones which are so sensitive to your motivations that they can go from 1 to 0 within the space of a single gradient update?
To be honest, this is all incredibly speculative, so please interpret all of the above with the disclaimer that it’s probably false or nonsensical for reasons I haven’t thought of yet.
An intuition I’m drawing on here: https://lamport.azurewebsites.net/pubs/buridan.pdf
- gwern 20 Apr 2022 3:54 UTC
  LW: 5 AF: 3
  AF Parent
  https://arxiv.org/abs/2204.06974 presents an OR-like construction which shields from gradients too, apparently, which might be of interest.
- Ofer 1 Feb 2020 7:09 UTC
  LW: 5 AF: 3
  AF Parent
  
  I wasn’t claiming that there’ll be an explicit OR gate, just something functionally equivalent to it.
  
  Sure, we’re on the same page here. I think by “There’s still a gradient signal to change the OR gate” you mean exactly what I meant when I said “that would just be passing the buck to the output of that OR”.
  
  I’m not sure I understand 2 and 3. The activations are in practice discrete (e.g. represented by 32 bits), and so the subnetworks can be designed such that they never output values within the range $[0.01, 0.99]$ (if that’s important/useful for the mechanism to work).
  
  It’s non-obvious that agents will have anywhere near enough control over their internal functioning to set up such systems. Have you ever tried implementing two novel independent identical submodules in your brain?
  
  Humans can’t control their brain in the level of abstraction of neurons—by thinking alone—but in a higher level of abstraction they do have some control that can be useful. For example, consider a human in a Newcomb’s problem that decides to 1-box. Arguably, they reason in a certain way in order to make their brain have a certain property (namely, being a brain that decides to 1-box in a Newcomb’s problem).
  
  (Independence is very tricky because they’re part of the same plan, and so a change in your underlying motivation to pursue that plan affects both).
  
  Perhaps I shouldn’t have used the word “independent”; I just meant that the output of one subnetwork does not affect the output of the other (during any given inference).