This post distinguishes between three kinds of “alignment”: 1. Not building an AI system at all, 2. Building Friendly AI that will remain perfectly aligned for all time and capability levels, 3. _Bootstrapped alignment_, in which we build AI systems that may not be perfectly aligned but are at least aligned enough that we can use them to build perfectly aligned systems.
The post argues that optimization-based approaches can’t lead to perfect alignment, because there will always eventually be Goodhart effects.
Planned summary for the Alignment Newsletter:
Looks good to me! Thanks for planning to include this in the AN!