Are these framings of gradient hacking, which I previously articulated here, a useful categorization?
Masking: Introducing a countervailing, “artificial” performance penalty that “masks” the performance benefits of ML modifications that do well on the SGD objective, but not on the mesa-objective;
Spoofing: Withholding performance gains until the implementation of certain ML modifications that are desirable to the mesa-objective; and
Steering: In a reinforcement learning context, selectively sampling environmental states that will either leave the mesa-objective unchanged or “steer” the ML model in a way that favors the mesa-objective.
Are these framings of gradient hacking, which I previously articulated here, a useful categorization?