One weird trick to turn maximisers into minimisers
A putative new idea for AI control; index here.
A simple and easy design for a u-maximising agent that turns into a u-minimising one.
Let X be some boolean random variable outside the agent’s control, that will be determined at some future time t (based on a cosmic event, maybe?). Set it up so that P(X=1)=ε, and for a given utility u, consider the utility:
u# = (2/ε)Xu—u.
Before t, the expected value of (2/ε)X is 2, so u# = u. Hence the agent is a u-maximiser. After t, the most likely option is X=0, hence a little bit of evidence to that effect is enough to make u# into a u-minimiser.
This isn’t perfect corrigibility—the agent would be willing to sacrifice a bit of u-value (before t) in order to maintain its flexibility after t. To combat this effect, we could instead use:
u# = Ω(2/ε)Xu—u.
If Ω is large, then the agent is willing to pay very little u-value to maintain flexibility. However, the amount of evidence of X=0 that it needs to become a u-minimiser is equally proportional to Ω, so X better be a clear and convincing event.
I really don’t see how this would help, compared to just adding time dependence directly
How would you do that? For a reward function, that’s easy, but this is a utility function.
I really have no idea what the hitch is, here. In principle, a utility function can be over histories of the universe. Just care about different things in different parts of that history.
Let u be utility function linear in paperclips. Assume the agent has no ability to create or destroy paperclips for the first week; it needs to build up infrastructure and means first. We want it to be maximising u on Monday, -u on Tuesday, and u from Wednesday onwards. How can we accomplish this? And how can we accomplish it without the agent simply turning itself off for Tuesday?
u is a function of paperclips, which is in turn a function of time. So, u(p(t)) is the number of paperclips at time t.
U = integral[some reasonable bounds] {dt p(t) (t in first Tuesday?-1:1)}
So, the AI knows what it wants over all of the future, depending on time. When evaluating future plans for the future, it’s able to take that change into account.
Like, it might spend both Monday and Tuesday just building infrastructure. In any case, turning off won’t help on Tuesday because it will still know that there were paperclips then—not being on to observe them won’t help it.
I don’t see exactly how that would work—it can’t build paper clips during the first week, so u(p(t))=0 during that period. Therefore it should behave exactly as if nothing special happened on Tuesday?
And my comment on turning itself off for Tuesday was more that the Monday AI wouldn’t want it’s infrastructure ruined by the Tuesday version, and would just turn itself off to prevent that.
I see—I thought you meant it would run for a week building infrastructure, and then be able to build paperclips on the first Monday you named.
I’m not sure what you WANT it to do, really. Do you want it to actually sabotage itself on Tuesday, or do you want it to keep on building infrastructure for later paperclip construction?
Under the system I built, it would do absolutely nothing different on Tuesday and continue to build infrastructure because it anticipates wanting more paperclips by the time it is able to build them at the end of the week. It wants low paperclips now, but it has no influence over paperclips now. It has influence over paperclips in the future, and it wants that there will be more of them when that time comes.
I’m trying to implement value change (see eg http://lesswrong.com/lw/jxa/proper_value_learning_through_indifference/ ). The change from u to -u is the easiest example of such a change. The ideal—which probably can’t be implemented in a standard utility function—is that it is a u-maximiser that’s indifferent to becoming a -u maximiser, who’s then indifferent to further change, etc...
Well, then, let’s change from the example being Monday + to Tuesday—to Wednesday and all later times +, with it unable to actually affect paperclip counts on Tuesday, let’s consider if we just have a transition from u+ on Monday, Tuesday, Wednesday +, with u- on Thursday and later times, and it already has all the infrastructure it needs.
In this case, it will see that it can get a + score by having paperclips monday through wednesday, but that any that it still has on Thursday will count against it.
So, it will build paperclips as soon as it learns of this pattern. It will make them have a low melting point, and it will build a furnace†. On Wednesday evening at the stroke of midnight, it will dump its paperclips into the furnace. Because all along, from the very beginning, it will have wanted there to be paperclips M-W, and not after then. And on Thursday it will be happy that there were paperclips M-W, but glad that there aren’t now.
I think that the trick is getting it to submit to changes to its utility function based on what we want at that time, without trying to game it. That’s going to be much harder.
† and, if it suspects that there are paperclips out in the wild, it will begin building machines to hunt them down, and iff it’s Thursday or later, destroy them. It will do this as soon as it learns that it will eventually be a paperclip minimizer for long enough that it is worth worrying about.