It’s not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can’t be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)
It’s not as clear to me that epistemic competitiveness + a normal training loop suffices to penalize gradient hacking. Would need to think about it more. Whereas I feel reasonably comfortable saying that if the model is able to alter its behavior because it believes that it will make SGD work worse, then a competitive overseer is able to use the same information to make SGD work better. (Though I can’t be that comfortable about anything given how shaky the abstractions are etc., and mostly expect to revisit with a clearer sense of what epistemic competitiveness means.)