I don’t see exactly how that would work—it can’t build paper clips during the first week, so u(p(t))=0 during that period. Therefore it should behave exactly as if nothing special happened on Tuesday?
And my comment on turning itself off for Tuesday was more that the Monday AI wouldn’t want it’s infrastructure ruined by the Tuesday version, and would just turn itself off to prevent that.
I see—I thought you meant it would run for a week building infrastructure, and then be able to build paperclips on the first Monday you named.
I’m not sure what you WANT it to do, really. Do you want it to actually sabotage itself on Tuesday, or do you want it to keep on building infrastructure for later paperclip construction?
Under the system I built, it would do absolutely nothing different on Tuesday and continue to build infrastructure because it anticipates wanting more paperclips by the time it is able to build them at the end of the week. It wants low paperclips now, but it has no influence over paperclips now. It has influence over paperclips in the future, and it wants that there will be more of them when that time comes.
I’m trying to implement value change (see eg http://lesswrong.com/lw/jxa/proper_value_learning_through_indifference/ ). The change from u to -u is the easiest example of such a change. The ideal—which probably can’t be implemented in a standard utility function—is that it is a u-maximiser that’s indifferent to becoming a -u maximiser, who’s then indifferent to further change, etc...
Well, then, let’s change from the example being Monday + to Tuesday—to Wednesday and all later times +, with it unable to actually affect paperclip counts on Tuesday, let’s consider if we just have a transition from u+ on Monday, Tuesday, Wednesday +, with u- on Thursday and later times, and it already has all the infrastructure it needs.
In this case, it will see that it can get a + score by having paperclips monday through wednesday, but that any that it still has on Thursday will count against it.
So, it will build paperclips as soon as it learns of this pattern. It will make them have a low melting point, and it will build a furnace†. On Wednesday evening at the stroke of midnight, it will dump its paperclips into the furnace. Because all along, from the very beginning, it will have wanted there to be paperclips M-W, and not after then. And on Thursday it will be happy that there were paperclips M-W, but glad that there aren’t now.
I think that the trick is getting it to submit to changes to its utility function based on what we want at that time, without trying to game it. That’s going to be much harder.
† and, if it suspects that there are paperclips out in the wild, it will begin building machines to hunt them down, and iff it’s Thursday or later, destroy them. It will do this as soon as it learns that it will eventually be a paperclip minimizer for long enough that it is worth worrying about.
I don’t see exactly how that would work—it can’t build paper clips during the first week, so u(p(t))=0 during that period. Therefore it should behave exactly as if nothing special happened on Tuesday?
And my comment on turning itself off for Tuesday was more that the Monday AI wouldn’t want it’s infrastructure ruined by the Tuesday version, and would just turn itself off to prevent that.
I see—I thought you meant it would run for a week building infrastructure, and then be able to build paperclips on the first Monday you named.
I’m not sure what you WANT it to do, really. Do you want it to actually sabotage itself on Tuesday, or do you want it to keep on building infrastructure for later paperclip construction?
Under the system I built, it would do absolutely nothing different on Tuesday and continue to build infrastructure because it anticipates wanting more paperclips by the time it is able to build them at the end of the week. It wants low paperclips now, but it has no influence over paperclips now. It has influence over paperclips in the future, and it wants that there will be more of them when that time comes.
I’m trying to implement value change (see eg http://lesswrong.com/lw/jxa/proper_value_learning_through_indifference/ ). The change from u to -u is the easiest example of such a change. The ideal—which probably can’t be implemented in a standard utility function—is that it is a u-maximiser that’s indifferent to becoming a -u maximiser, who’s then indifferent to further change, etc...
Well, then, let’s change from the example being Monday + to Tuesday—to Wednesday and all later times +, with it unable to actually affect paperclip counts on Tuesday, let’s consider if we just have a transition from u+ on Monday, Tuesday, Wednesday +, with u- on Thursday and later times, and it already has all the infrastructure it needs.
In this case, it will see that it can get a + score by having paperclips monday through wednesday, but that any that it still has on Thursday will count against it.
So, it will build paperclips as soon as it learns of this pattern. It will make them have a low melting point, and it will build a furnace†. On Wednesday evening at the stroke of midnight, it will dump its paperclips into the furnace. Because all along, from the very beginning, it will have wanted there to be paperclips M-W, and not after then. And on Thursday it will be happy that there were paperclips M-W, but glad that there aren’t now.
I think that the trick is getting it to submit to changes to its utility function based on what we want at that time, without trying to game it. That’s going to be much harder.
† and, if it suspects that there are paperclips out in the wild, it will begin building machines to hunt them down, and iff it’s Thursday or later, destroy them. It will do this as soon as it learns that it will eventually be a paperclip minimizer for long enough that it is worth worrying about.