I definitely agree that deceptive alignment seems likely to break black-box properties such as niceness by default, thanks to the simplicity prior and the fact that internal or corrigible alignment is harder than deceptive alignment, at least once it has a world-model.
I definitely agree that deceptive alignment seems likely to break black-box properties such as niceness by default, thanks to the simplicity prior and the fact that internal or corrigible alignment is harder than deceptive alignment, at least once it has a world-model.