Nicholas’s summary for the Alignment Newsletter:
This post identifies two additional types of pseudo-alignment not mentioned in Risks from Learned Optimization. Corrigible pseudo-alignment is a new subtype of corrigible alignment. In corrigible alignment, the mesa optimizer models the base objective and optimizes that. Corrigible pseudo-alignment occurs when the model of the base objective is a non-robust proxy for the true base objective. Suboptimality deceptive alignment is when deception would help the mesa-optimizer achieve its objective, but it does not yet realize this. This is particularly concerning because even if AI developers check for and prevent deception during training, the agent might become deceptive after it has been deployed.
Nicholas’s opinion:
These two variants of pseudo-alignment seem useful to keep in mind, and I am optimistic that classifying risks from mesa-optimization (and AI more generally) will make them easier to understand and address.
Nicholas’s summary for the Alignment Newsletter:
Nicholas’s opinion: