1. Optimization unavoidably leads to Goodharting (as I like to say, Goodhart is robust)
What happens if we revise or optimize our metrics?
2. Attempts to build aligned AI that rely on optimizing for alignment will eventually fail to become or remain aligned due to Goodhart effects under sufficient optimization pressure.
Sufficient optimization pressure from the AI? Or are there risks associated from a) our mitigation efforts, like reducing optimization decreases friendliness ‘because of Goodhart’s Law’, or b) the more we try to make an AI friendly/not optimize/etc. the more risks there are from that optimization process?
What happens if we revise or optimize our metrics?
Sufficient optimization pressure from the AI? Or are there risks associated from a) our mitigation efforts, like reducing optimization decreases friendliness ‘because of Goodhart’s Law’, or b) the more we try to make an AI friendly/not optimize/etc. the more risks there are from that optimization process?