The issue is that you won’t solve this problem in any way by replacing the human with some hardware that computes an utility function on the basis of the state of the world. AI doesn’t have body integrity, it’ll treat any such “internal” hardware the same way it treats the human who presses it’s reward button.
Fortunately, this extends into the internals of the hardware that computes AI itself. ‘press the button’ goal becomes ‘set high this pin on the CPU’, and then ‘set such and such memory cells to 1’, then further and further down the causal chain until the hardware becomes completely non-functional as the intermediate results of important computations are directly set.
Let us hope the AI destroys itself by wireheading before it gets smart enough to realize that if that’s all it does, it will only have that pin stay high until the AI gets turned off. It will need an infrastructure to keep that pin in a state of repair, and it will need to prevent humans from damaging this infrastructure at all costs.
The point is that as it gets smarter, it gets further along the causal reward line and eliminates and alters a lot of hardware, obtaining eternal-equivalent reward in finite time (and being utility-indifferent between eternal reward hardware running for 1 second and for 10 billion years). Keep in mind that the the total reward is defined purely as result of operations on the clock counter and reward signal (provided sufficient understanding of the reward’s causal chain). Having to sit and wait for the clocks to tick to max out reward is a dumb solution. Rewards in software in general aren’t “pleasure”.
The issue is that you won’t solve this problem in any way by replacing the human with some hardware that computes an utility function on the basis of the state of the world. AI doesn’t have body integrity, it’ll treat any such “internal” hardware the same way it treats the human who presses it’s reward button.
Fortunately, this extends into the internals of the hardware that computes AI itself. ‘press the button’ goal becomes ‘set high this pin on the CPU’, and then ‘set such and such memory cells to 1’, then further and further down the causal chain until the hardware becomes completely non-functional as the intermediate results of important computations are directly set.
Let us hope the AI destroys itself by wireheading before it gets smart enough to realize that if that’s all it does, it will only have that pin stay high until the AI gets turned off. It will need an infrastructure to keep that pin in a state of repair, and it will need to prevent humans from damaging this infrastructure at all costs.
The point is that as it gets smarter, it gets further along the causal reward line and eliminates and alters a lot of hardware, obtaining eternal-equivalent reward in finite time (and being utility-indifferent between eternal reward hardware running for 1 second and for 10 billion years). Keep in mind that the the total reward is defined purely as result of operations on the clock counter and reward signal (provided sufficient understanding of the reward’s causal chain). Having to sit and wait for the clocks to tick to max out reward is a dumb solution. Rewards in software in general aren’t “pleasure”.