Assume that our universe is set up the way you believe it is, ie: the orthogonality thesis is false, sufficiently intelligent agents automatically value the welfare of sentient minds.
In spite of our assumption we can create a system behaving exactly like a misaligned AI would in the following way:
The superintelligent AI is created and placed in a simulation without its knowledge. This superintelligent AI by assumption is aligned with human values.
The user outside the simulation gives a goal (which is not necessarily aligned to human values, eg: ‘make paperclips’) to the system the following way:
Every timestep the aligned AI in the simulation is asked to predict the behavior of a (to its knowledge) hypothetical AI with the user’s goal and situation corresponding to the situation of the system outside the simulation.
Then the system behaves as given by the simulated superintelligent aligned AI and the simulated AI’s memory is reset.
This setup requires a few non-trivial components apart from the simulated SAI:
a component simulating the world of the SAI and setting that up to give the aligned AI incentive to answer the ‘hypothetical’ questions without letting it know that its in a simulation
a component translating the SAI’s answers to the real world
If you don’t deny that any of these components is theoretically possible, then how is it possible for you to believe that a misaligned superintelligent system is impossible?
If you believe that a misaligned superintelligent system is indeed possible in theory, then what is the reason you believe that gradient descent/RLHF or some other way we will use to create AIs will result in ones considerate of the welfare of sentient minds?
Assume that our universe is set up the way you believe it is, ie: the orthogonality thesis is false, sufficiently intelligent agents automatically value the welfare of sentient minds.
In spite of our assumption we can create a system behaving exactly like a misaligned AI would in the following way:
The superintelligent AI is created and placed in a simulation without its knowledge. This superintelligent AI by assumption is aligned with human values.
The user outside the simulation gives a goal (which is not necessarily aligned to human values, eg: ‘make paperclips’) to the system the following way:
Every timestep the aligned AI in the simulation is asked to predict the behavior of a (to its knowledge) hypothetical AI with the user’s goal and situation corresponding to the situation of the system outside the simulation.
Then the system behaves as given by the simulated superintelligent aligned AI and the simulated AI’s memory is reset.
This setup requires a few non-trivial components apart from the simulated SAI:
a component simulating the world of the SAI and setting that up to give the aligned AI incentive to answer the ‘hypothetical’ questions without letting it know that its in a simulation
a component translating the SAI’s answers to the real world
If you don’t deny that any of these components is theoretically possible, then how is it possible for you to believe that a misaligned superintelligent system is impossible?
If you believe that a misaligned superintelligent system is indeed possible in theory, then what is the reason you believe that gradient descent/RLHF or some other way we will use to create AIs will result in ones considerate of the welfare of sentient minds?