Might recommender systems provide a similar phenomenon? The basic argument is that, instead of fulfilling the user’s current preferences, they can shape the user’s preferences to make the task easier, thus shaping their gradient to better achieve their goal. I haven’t read enough about the inner alignment discussion of gradient hacking to fully understand the points of disanalogy, but at least one difference is that there is no mesa-optimizer in the recommender systems. Curious to hear your thoughts.
Yeah, I read the ADS paper(s) after writing this post. I think it’s a useful framing, more ‘selection theorem’ ey and with less emphasis on deliberateness/purposefulness.
Additionally, I think there is another conceptual distinction worth attending to
auto-induced distributional shift is about affecting environment to change inputs
the system itself might remain unchanging and undergo no further learning, and still qualify
gradient hacking is about changing environment/inputs/observations to change updates (gradients)
the system is presumed subject to updates, which it is taking (some amount of deliberate) influence over
In this post I wrote
I’m looking for deliberate behaviours which are intended to affect the fitness landscape of the outer process.
which I think rules out (hopefully!) contemporary recommender systems on the above two distinctions (as you gestured to regarding mesa-optimization).
In practice, for a system subject to online outer training, ADS changes the inputs which changes the training distribution, in fact causing some change in the updates to the system (perhaps even a large change!). But ADS per se doesn’t imply these effects are deliberate, though again you might be able to say something selection-theorem-ey about this process if iterated. Indeed, a competent and deliberate gradient hacker might use means of ADS quite effectively.
None of this is to say that ADS is not a concern, I just think it’s conceptually somewhat distinct!
In short, I think ADS available as a mechanism to the extent that the responses of a system can affect subsequent inputs to the system (technically this is always, but in practice the degree of effect varies enormously). This need not be a system subject to further training updates, though if it is, depending how those updates are generated, ADS behaviour may or may not be reinforced.
Gradient hacking was originally coined to mean deliberate, situationally aware influence over training updates. (ADS is one mechanism by which this could be achieved.)
The term ‘gradient hacking’ seems to also be used commonly to refer to any kind of system influence over training updates, whether situationally aware/deliberate or no. I think it’s helpful to distinguish these so I often say ‘deliberate gradient hacking’ to make sure.
Might recommender systems provide a similar phenomenon? The basic argument is that, instead of fulfilling the user’s current preferences, they can shape the user’s preferences to make the task easier, thus shaping their gradient to better achieve their goal. I haven’t read enough about the inner alignment discussion of gradient hacking to fully understand the points of disanalogy, but at least one difference is that there is no mesa-optimizer in the recommender systems. Curious to hear your thoughts.
Inspired by these two recent papers:
By David Scott Krueger: https://openreview.net/pdf?id=mMiKHj7Pobj
And https://arxiv.org/abs/2204.11966
Yeah, I read the ADS paper(s) after writing this post. I think it’s a useful framing, more ‘selection theorem’ ey and with less emphasis on deliberateness/purposefulness.
Additionally, I think there is another conceptual distinction worth attending to
auto-induced distributional shift is about affecting environment to change inputs
the system itself might remain unchanging and undergo no further learning, and still qualify
gradient hacking is about changing environment/inputs/observations to change updates (gradients)
the system is presumed subject to updates, which it is taking (some amount of deliberate) influence over
In this post I wrote
which I think rules out (hopefully!) contemporary recommender systems on the above two distinctions (as you gestured to regarding mesa-optimization).
In practice, for a system subject to online outer training, ADS changes the inputs which changes the training distribution, in fact causing some change in the updates to the system (perhaps even a large change!). But ADS per se doesn’t imply these effects are deliberate, though again you might be able to say something selection-theorem-ey about this process if iterated. Indeed, a competent and deliberate gradient hacker might use means of ADS quite effectively.
None of this is to say that ADS is not a concern, I just think it’s conceptually somewhat distinct!
In short, I think ADS available as a mechanism to the extent that the responses of a system can affect subsequent inputs to the system (technically this is always, but in practice the degree of effect varies enormously). This need not be a system subject to further training updates, though if it is, depending how those updates are generated, ADS behaviour may or may not be reinforced.
Gradient hacking was originally coined to mean deliberate, situationally aware influence over training updates. (ADS is one mechanism by which this could be achieved.)
The term ‘gradient hacking’ seems to also be used commonly to refer to any kind of system influence over training updates, whether situationally aware/deliberate or no. I think it’s helpful to distinguish these so I often say ‘deliberate gradient hacking’ to make sure.