In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.
I’m going to read this as ”...1 new potential gradient hacking pathway” because I think that’s what the section is mainly about. (It appears to me that throughout the section you’re conflating mesa-optimization with gradient hacking, but that’s not the main thing I want to talk about.)
The following quote indicates at least two potential avenues of gradient hacking: “In an RL context”, “supervised learning with adaptive data sampling”. These both flow through the gradient hacker affecting the data distribution, but they seem worth distinguishing, because there are many ways a malign gradient hacker could affect the data distribution.
Broadly, I’m confused about why others (confidently) think gradient hacking is difficult. Like, we have this pretty obvious pathway of a gradient hacker affecting training data. And it seems very likely that AIs are going to be training on their own outputs or otherwise curating their data distribution — see e.g.,
Phi, recent small scale success of using lots of synthetic data in pre-training,
The large class of self-play approaches that I often lump together under “Expert-iteration” which involve iteratively training on the best of your previous actions,
the fact that RLHF usually uses a preference model derived from the same base/SFT model being trained.
Sure, it may be difficult to predictably affect training via partial control over the data distribution. Personally I have almost zero clue how to affect model training via data curation, so my epistemic state is extremely uncertain. I roughly feel like the rest of humanity is in a similar position — we have an incredibly poor understanding of large language model training dynamics — so we shouldn’t be confident that gradient hacking is difficult. On the other hand, it’s reasonable to be like “if you’re not smarter than (some specific set of) current humans, it is very hard for you to gradient hack, as evidence by us not knowing how to do it.”
I don’t think strong confidence in either direction is merited by our state of knowledge on gradient hacking.
Basically, it’s a combo of not being incentivized to do it, combined with the fact that SGD is actually really powerful in ways that undermines the traditional story for gradient hacking.
One of the most important things to keep in mind is that gradient descent optimizes independently and simultaneously, which means that for a gradient hacker, unless it contains non-differentiable components, there’s no way for the inner misaligned agent to escape being optimized away by SGD, and since it optimizes the entire causal graph leading to the loss, there is very little avenue for a gradient hacker to escape being optimized away.
In general, this is a big problem with a lot of stories of danger that rely on goal divergences between the base and the mesa optimizer: How do you prevent the mesa-optimizer from being optimized away by SGD? For a lot of stories, the likely answer is you can’t, and the stories that people propose usually fall victim to the issue that SGD is too good at credit assignment, compared to genetic algorithms or evolutionary methods.
I’m going to read this as ”...1 new potential gradient hacking pathway” because I think that’s what the section is mainly about. (It appears to me that throughout the section you’re conflating mesa-optimization with gradient hacking, but that’s not the main thing I want to talk about.)
The following quote indicates at least two potential avenues of gradient hacking: “In an RL context”, “supervised learning with adaptive data sampling”. These both flow through the gradient hacker affecting the data distribution, but they seem worth distinguishing, because there are many ways a malign gradient hacker could affect the data distribution.
Broadly, I’m confused about why others (confidently) think gradient hacking is difficult. Like, we have this pretty obvious pathway of a gradient hacker affecting training data. And it seems very likely that AIs are going to be training on their own outputs or otherwise curating their data distribution — see e.g.,
Phi, recent small scale success of using lots of synthetic data in pre-training,
Constitutional AI / self-critique,
using LLMs for data labeling and content moderation,
The large class of self-play approaches that I often lump together under “Expert-iteration” which involve iteratively training on the best of your previous actions,
the fact that RLHF usually uses a preference model derived from the same base/SFT model being trained.
Sure, it may be difficult to predictably affect training via partial control over the data distribution. Personally I have almost zero clue how to affect model training via data curation, so my epistemic state is extremely uncertain. I roughly feel like the rest of humanity is in a similar position — we have an incredibly poor understanding of large language model training dynamics — so we shouldn’t be confident that gradient hacking is difficult. On the other hand, it’s reasonable to be like “if you’re not smarter than (some specific set of) current humans, it is very hard for you to gradient hack, as evidence by us not knowing how to do it.”
I don’t think strong confidence in either direction is merited by our state of knowledge on gradient hacking.
Basically, it’s a combo of not being incentivized to do it, combined with the fact that SGD is actually really powerful in ways that undermines the traditional story for gradient hacking.
One of the most important things to keep in mind is that gradient descent optimizes independently and simultaneously, which means that for a gradient hacker, unless it contains non-differentiable components, there’s no way for the inner misaligned agent to escape being optimized away by SGD, and since it optimizes the entire causal graph leading to the loss, there is very little avenue for a gradient hacker to escape being optimized away.
In general, this is a big problem with a lot of stories of danger that rely on goal divergences between the base and the mesa optimizer: How do you prevent the mesa-optimizer from being optimized away by SGD? For a lot of stories, the likely answer is you can’t, and the stories that people propose usually fall victim to the issue that SGD is too good at credit assignment, compared to genetic algorithms or evolutionary methods.