paulfchristiano comments on Mundane solutions to exotic problems

paulfchristiano 21 May 2021 19:38 UTC
LW: 4 AF: 4
AF
leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients
This seems slightly confusing/unclear—I’m not imagining penalizing the model for trying to hack the gradients, I’m imagining changing the loss in a way that blocks the attempted gradient hacking. E.g. the model knows that parameters $θ$ are in the direction of more aligned models, and it could hijack the training process by ensuring that $θ$ gets a high loss. So it tries to behave badly when its own parameters are $θ$ , trying to prevent gradient descent from converging to much better parameters $θ^{*}$ . But then the overseer knows that it wants training to move towards parameters $θ$ , in order to ultimately reach much better parameters $θ^{*}$ , so it assigns $θ$ a low loss (rather than being fooled by the behavioral heuristic giving them a high loss).
In this case no model is getting penalized for anything, and in particular we aren’t trying to stop gradient descent from reaching parameters that try to gradient hack. We are just stepping in when the malicious model tries to “outsmart” the behavioral heuristic in order to manipulate training.
- Rohin Shah 24 May 2021 22:29 UTC
  LW: 2 AF: 2
  AF Parent
  Ah, I see, that makes sense. I had in fact misunderstood what you were saying here.
  That being said, why not penalize the model when it engages in gradient hacking, so that it becomes less likely to do it in the future?
  - paulfchristiano 24 May 2021 23:06 UTC
    LW: 4 AF: 4
    AF Parent
    It’s not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can’t be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)