I think I agree with everything in this comment, and that paragraph was mostly intended as the foundation for the second point I made (disagreeing with your assessment in “What if Alex has benevolent motivations?”).
Part of the disagreement here might be on how I think “be honest and friendly” factorizes into lots of subgoals (“be polite”, “don’t hurt anyone”, “inform the human if a good-seeming plan is going to have bad results 3 days from now”, “tell Stalinists true facts about what Stalin actually did”), and while Alex will definitely learn to terminally value the wrong (not honest+helpful+harmless) outcomes for some of these goal-axes, it does seem likely to learn to value other axes robustly.
I think I agree with everything in this comment, and that paragraph was mostly intended as the foundation for the second point I made (disagreeing with your assessment in “What if Alex has benevolent motivations?”).
Part of the disagreement here might be on how I think “be honest and friendly” factorizes into lots of subgoals (“be polite”, “don’t hurt anyone”, “inform the human if a good-seeming plan is going to have bad results 3 days from now”, “tell Stalinists true facts about what Stalin actually did”), and while Alex will definitely learn to terminally value the wrong (not honest+helpful+harmless) outcomes for some of these goal-axes, it does seem likely to learn to value other axes robustly.