The author’s goal is to find “mundane” or simple algorithms that solve even “exotic” problems in AI alignment. Why should we expect this is possible? If an AI system is using powerful, exotic capabilities to evade detection, shouldn’t we need powerful, exotic algorithms to fight that? The key idea here is that we can instead have a mundane algorithm that leverages the exotic capabilities of the AI system to produce an exotic oversight process. For example, we could imagine that a mundane algorithm could be used to create a question-answerer that knows everything the model knows. We could then address <@gradient hacking@>(@Gradient hacking@) by asking the question “what should the loss be?” In this case, our model has an exotic capability: very strong introspective access to its own reasoning and the training process that modifies it. (This is what is needed to successfully hack gradients). As a result, our question answerer should be able to leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients, even if our normal hardcoded loss would not do so.
leverage this capability to assign high loss (low reward) to cases where our AI system tries to hack gradients
This seems slightly confusing/unclear—I’m not imagining penalizing the model for trying to hack the gradients, I’m imagining changing the loss in a way that blocks the attempted gradient hacking. E.g. the model knows that parameters θ are in the direction of more aligned models, and it could hijack the training process by ensuring that θ gets a high loss. So it tries to behave badly when its own parameters are θ, trying to prevent gradient descent from converging to much better parameters θ∗. But then the overseer knows that it wants training to move towards parameters θ, in order to ultimately reach much better parameters θ∗, so it assigns θ a low loss (rather than being fooled by the behavioral heuristic giving them a high loss).
In this case no model is getting penalized for anything, and in particular we aren’t trying to stop gradient descent from reaching parameters that try to gradient hack. We are just stepping in when the malicious model tries to “outsmart” the behavioral heuristic in order to manipulate training.
It’s not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can’t be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)
Planned summary for the Alignment Newsletter:
This seems slightly confusing/unclear—I’m not imagining penalizing the model for trying to hack the gradients, I’m imagining changing the loss in a way that blocks the attempted gradient hacking. E.g. the model knows that parameters θ are in the direction of more aligned models, and it could hijack the training process by ensuring that θ gets a high loss. So it tries to behave badly when its own parameters are θ, trying to prevent gradient descent from converging to much better parameters θ∗. But then the overseer knows that it wants training to move towards parameters θ, in order to ultimately reach much better parameters θ∗, so it assigns θ a low loss (rather than being fooled by the behavioral heuristic giving them a high loss).
In this case no model is getting penalized for anything, and in particular we aren’t trying to stop gradient descent from reaching parameters that try to gradient hack. We are just stepping in when the malicious model tries to “outsmart” the behavioral heuristic in order to manipulate training.
Ah, I see, that makes sense. I had in fact misunderstood what you were saying here.
That being said, why not penalize the model when it engages in gradient hacking, so that it becomes less likely to do it in the future?
It’s not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can’t be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)