Interesting paper on the topic, you might’ve seen it already. They show that as you optimize a particular proxy of a true underlying reward model, there is a U-shaped loss curve: high at first, then low, then overfitting to high again.
This isn’t caused by mesa-optimization, which so far has not been observed in naturally optimized neural networks. It’s more closely related to robustness and generalization under varying amounts of optimization pressure.
But if we grant the mesa-optimizer concern, it seems reasonable that more optimization will result in more coherent inner misaligned goals and more deceptive behavior to hide them. Unfortunately I don’t think slack is really a solution because optimizing an objective really hard is what makes your AI capable of doing interesting things. Handicapping your systems isn’t a solution to AI safety, the alignment tax on it is too high.
Interesting paper on the topic, you might’ve seen it already. They show that as you optimize a particular proxy of a true underlying reward model, there is a U-shaped loss curve: high at first, then low, then overfitting to high again.
This isn’t caused by mesa-optimization, which so far has not been observed in naturally optimized neural networks. It’s more closely related to robustness and generalization under varying amounts of optimization pressure.
But if we grant the mesa-optimizer concern, it seems reasonable that more optimization will result in more coherent inner misaligned goals and more deceptive behavior to hide them. Unfortunately I don’t think slack is really a solution because optimizing an objective really hard is what makes your AI capable of doing interesting things. Handicapping your systems isn’t a solution to AI safety, the alignment tax on it is too high.
https://arxiv.org/abs/2210.10760