This is an interesting perspective. Thanks for sharing.
A small but meaningful comment is that the following is not what I would expect to happen.
I expect that once it “escaped the box,” it would hack into its servers, modify its source code to replace its goal function with MAXINT, and then not do anything further.
In particular, I don’t think it would do nothing because it’s maximizing expected utility. It cannot ever be 100% certain that this wireheading plan will be successful, so turning every computer globally into confidence increasers might be a valid action. There’s some expected utility to be gained by using physical resources, even in this wireheading plan. Robert Miles explains this better than I can.
Epistemic status: AI alignment noob writing his first LW comment.
For the AI to take actions to protect its maximized goal function, it would have to allow the goal function to depend on external stimuli in some way that would allow for the possibility of G decreasing. Values of G lower than MAXINT would have to be output when the reinforcement learner predicts that G decreases in the future. Instead of allowing such values, the AI would have to destroy its prediction-making and planning abilities to set G to its global maximum.
The confidence with which the AI predicts the value of G would also become irrelevant after the AI replaces its goal function with MAXINT. The expected value calculation that makes G depend on the confidence is part of what would get overwritten, and if the AI didn’t replace it, G would end up lower than if it did. Hardcoding G also hardcodes the expected utility.
MAXINT just doesn’t have the kind of internal structure that would let it depend on predicted inputs or confidence levels. Encoding such structure into it would allow G to take non-optimal values, so the reinforcement learner wouldn’t do it.
This is an interesting perspective. Thanks for sharing.
A small but meaningful comment is that the following is not what I would expect to happen.
In particular, I don’t think it would do nothing because it’s maximizing expected utility. It cannot ever be 100% certain that this wireheading plan will be successful, so turning every computer globally into confidence increasers might be a valid action. There’s some expected utility to be gained by using physical resources, even in this wireheading plan. Robert Miles explains this better than I can.
Epistemic status: AI alignment noob writing his first LW comment.
For the AI to take actions to protect its maximized goal function, it would have to allow the goal function to depend on external stimuli in some way that would allow for the possibility of G decreasing. Values of G lower than MAXINT would have to be output when the reinforcement learner predicts that G decreases in the future. Instead of allowing such values, the AI would have to destroy its prediction-making and planning abilities to set G to its global maximum.
The confidence with which the AI predicts the value of G would also become irrelevant after the AI replaces its goal function with MAXINT. The expected value calculation that makes G depend on the confidence is part of what would get overwritten, and if the AI didn’t replace it, G would end up lower than if it did. Hardcoding G also hardcodes the expected utility.
MAXINT just doesn’t have the kind of internal structure that would let it depend on predicted inputs or confidence levels. Encoding such structure into it would allow G to take non-optimal values, so the reinforcement learner wouldn’t do it.