This line of reasoning is absurd: it assumes an agent knows in advance the precise effects of self-improvement — but that’s not how learning works! If you knew exactly how an alteration in your understanding of the world would impact you, you wouldn’t need the alteration: to be able to make that judgement, you’d have to be able to reason as though you had already undergone it.
It seems there is some major confusion is going on here—it is, generally speaking, imporrible to know the outcome of an arbitrary computation without actually running it, but that does not mean it’s impossible to design a specific computation in a way you’d know exactly what the effects would be. For example, one does not need to know the trillionth digit of pi in order to write a program that they could be very certain would compute that digit.
You also seem to be too focused on minor modifications of a human-like mind, but focusing too narrowly on minds is also missing the point—focus on optimization programs instead.
For many different kinds of X, it should be possible to write a program that given a particular robotics apparatus (just the electromechanical parts without a specific control algorithm), predicts which electrical signals sent to robot’s actuators would result in more X. You can then place that program inside the robot and have the program’s output wired to the robot controls. The resulting robot does not “like” X, it’s just robotically optimizing for X.
The orthogonality principle just says that there is nothing particularly special about human-aligned Xs that would make the X-robot more likely to work well for those Xs over Xs that result in human extinction (e.g. due to convergent instrumental goals, X does not need to specifically be anti-human).
It seems there is some major confusion is going on here—it is, generally speaking, imporrible to know the outcome of an arbitrary computation without actually running it, but that does not mean it’s impossible to design a specific computation in a way you’d know exactly what the effects would be. For example, one does not need to know the trillionth digit of pi in order to write a program that they could be very certain would compute that digit.
You also seem to be too focused on minor modifications of a human-like mind, but focusing too narrowly on minds is also missing the point—focus on optimization programs instead.
For many different kinds of X, it should be possible to write a program that given a particular robotics apparatus (just the electromechanical parts without a specific control algorithm), predicts which electrical signals sent to robot’s actuators would result in more X. You can then place that program inside the robot and have the program’s output wired to the robot controls. The resulting robot does not “like” X, it’s just robotically optimizing for X.
The orthogonality principle just says that there is nothing particularly special about human-aligned Xs that would make the X-robot more likely to work well for those Xs over Xs that result in human extinction (e.g. due to convergent instrumental goals, X does not need to specifically be anti-human).