Basically, it’s a combo of not being incentivized to do it, combined with the fact that SGD is actually really powerful in ways that undermines the traditional story for gradient hacking.
One of the most important things to keep in mind is that gradient descent optimizes independently and simultaneously, which means that for a gradient hacker, unless it contains non-differentiable components, there’s no way for the inner misaligned agent to escape being optimized away by SGD, and since it optimizes the entire causal graph leading to the loss, there is very little avenue for a gradient hacker to escape being optimized away.
In general, this is a big problem with a lot of stories of danger that rely on goal divergences between the base and the mesa optimizer: How do you prevent the mesa-optimizer from being optimized away by SGD? For a lot of stories, the likely answer is you can’t, and the stories that people propose usually fall victim to the issue that SGD is too good at credit assignment, compared to genetic algorithms or evolutionary methods.
Basically, it’s a combo of not being incentivized to do it, combined with the fact that SGD is actually really powerful in ways that undermines the traditional story for gradient hacking.
One of the most important things to keep in mind is that gradient descent optimizes independently and simultaneously, which means that for a gradient hacker, unless it contains non-differentiable components, there’s no way for the inner misaligned agent to escape being optimized away by SGD, and since it optimizes the entire causal graph leading to the loss, there is very little avenue for a gradient hacker to escape being optimized away.
In general, this is a big problem with a lot of stories of danger that rely on goal divergences between the base and the mesa optimizer: How do you prevent the mesa-optimizer from being optimized away by SGD? For a lot of stories, the likely answer is you can’t, and the stories that people propose usually fall victim to the issue that SGD is too good at credit assignment, compared to genetic algorithms or evolutionary methods.