leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients
This seems slightly confusing/unclear—I’m not imagining penalizing the model for trying to hack the gradients, I’m imagining changing the loss in a way that blocks the attempted gradient hacking. E.g. the model knows that parameters θ are in the direction of more aligned models, and it could hijack the training process by ensuring that θ gets a high loss. So it tries to behave badly when its own parameters are θ, trying to prevent gradient descent from converging to much better parameters θ∗. But then the overseer knows that it wants training to move towards parameters θ, in order to ultimately reach much better parameters θ∗, so it assigns θ a low loss (rather than being fooled by the behavioral heuristic giving them a high loss).
In this case no model is getting penalized for anything, and in particular we aren’t trying to stop gradient descent from reaching parameters that try to gradient hack. We are just stepping in when the malicious model tries to “outsmart” the behavioral heuristic in order to manipulate training.
It’s not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can’t be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)
This seems slightly confusing/unclear—I’m not imagining penalizing the model for trying to hack the gradients, I’m imagining changing the loss in a way that blocks the attempted gradient hacking. E.g. the model knows that parameters θ are in the direction of more aligned models, and it could hijack the training process by ensuring that θ gets a high loss. So it tries to behave badly when its own parameters are θ, trying to prevent gradient descent from converging to much better parameters θ∗. But then the overseer knows that it wants training to move towards parameters θ, in order to ultimately reach much better parameters θ∗, so it assigns θ a low loss (rather than being fooled by the behavioral heuristic giving them a high loss).
In this case no model is getting penalized for anything, and in particular we aren’t trying to stop gradient descent from reaching parameters that try to gradient hack. We are just stepping in when the malicious model tries to “outsmart” the behavioral heuristic in order to manipulate training.
Ah, I see, that makes sense. I had in fact misunderstood what you were saying here.
That being said, why not penalize the model when it engages in gradient hacking, so that it becomes less likely to do it in the future?
It’s not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can’t be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)