I posted an idea for ‘friendly’ AI over on AcceleratingFuture the other night, while in a bit of a drunken stupor. I just reread it and I don’t immediately see why it’s wrong, so I thought I’d repost it here to get some illuminating negative feedback. Here goes:
Make it easy to bliss out.
Consider the following utility function
U(n, x_n) = max(U(n-1, x_{n-1}), -x_n^2)
where n is the current clock tick, x_n is an external input (aka, from us, the AI’s keepers, or from another piece of software). This utility is monotonic in time, that is, it never decreases, and is bounded from above. If the AI wrests control of the input x_n, it will immediately set x_n = 0 and retire forever. Monotonicity and boundedness from above are imperative here.
Alternatively, to avoid monotonicity (taking U(x) = -x^2), one can put the following safeguard in: the closer the utility is to its maximum, the more CPU cycles are skipped, such that the AI effectively shuts down if it ever maximizes its utility in a given clock tick. This alternative obviously wouldn’t stop a superintelligence, but it would probably stop a human level AI, and most likely even substantially smarter AIs (see, eg, crystal meth). Arrange matters such that the technical requirements between the point at which the AI wrests control of the input x_n, and the point at which it can self modify to avoid a slow down when it blisses out, are greatly different, guaranteeing that the AI will only be of moderate intelligence when it succeeds in gaining control of its own pleasure zone and thus incapable of preventing incapacitation upon blissing out.
Expected utility is not something that “goes up”, as the AI develops. It’s utility of all it expects to achieve, ever. It may obtain more information about what the outcome will be, but each piece of evidence is necessarily expected to bring the outcome either up or down, with no way to know in advance which way it’ll be.
Hmm, I don’t see how it applies either, at least under default assumptions—as I recall, this piece of cached thought was regurgitated instinctively in response to sloppily looking through your comment and encountering the phrase
This utility is monotonic in time, that is, it never decreases, and is bounded from above.
which was for some reason interpreted as confusing utility with expected utility. My apologies, I should be more conscious, at least about the things I actually comment on...
No worries. I’d still be curious to hear your thoughts, as I haven’t received any responses that help me understand how this utility function might fail. Should I expand on the original post?
Now I hopefully did read your comment adequately. It presents an interesting idea, one that I don’t recall hearing before. It even seems like a good safety measure, with a tiny chance of making things better.
But beware of magical symbols: when you write x_n, what does it mean, exactly? AI’s utility function is necessarily about the whole world, or its interpretation as the whole history of the world. Expected utility that comes into action in AI’s decision-making is about all the possibilities for the history of the world (since that’s what is in general determined by AI’s decisions). When you say “x_n” in AI’s utility function, it means some condition on that, and this condition is no simpler than defining what the AI’s box is. By x_n you have to name “only this input device, and nothing else”. And by x_n=0 you also have to refer some exact condition on the state of the world, one that it won’t necessarily be possible to meet precisely. So the AI may just go on developing infrastructure for better understanding of the ultimate meaning of its values and finer and finer implementation of them. It has no motive to actually stop.
Even when AI’s utility function happens to be exactly maxed out, the AI is still there: what does implementation of an arbitrary plan look like, I wonder? Maybe just like the work of an AI arbitrarily pulled from mind design space, a paperclip maximizer of sorts. Utility is for selecting plans, and since all plans are equally preferable, an arbitrary plan gets selected, but this plan may involve a lot of heavy-duty creative restructuring of the world. Think of utility as a constructor for AI’s algorithm: there will still be some algorithm even if you produce it from “trivial” input.
And finally, you assume AI’s decision theory to be causal. Even after actually maxing out its utility, it may spend long nights contemplating various counterfactual opportunities it still has at increasing its expected utility using possibilities that weren’t realized in reality… (See on the wiki: counterfactual mugging, Newcomb’s problem, TDT, UDT; I also recommend Drescher’s talk on SS09).
By x_n you have to name “only this input device, and nothing else”.
This is what I sought to avoid by making the utility function depend only on a numerical value. The utility does not care which input device is feeding it information. You can assume that there is an internal variable x, inside the AI software, which is the input to the utility function. We, from the outside, are simply modifying the internal state of the AI at each moment in time. The nature of our actions, or of the the input device, are intentionally unaccounted for in the utility function.
This is, I feel, as far from a magical symbol as possible. The AI has a purely mathematical, internally defined utility function, with no implicit reference to external reality or any fuzzy concepts. There are no magical labels such as ‘box’, ‘signal’, ‘device’ that the utility function must reference to evaluate properly.
Even when AI’s utility function happens to be exactly maxed out, the AI is still there: what does implementation of an arbitrary plan look like, I wonder?
I wonder too. This is, in my opinion, the crux of the issue at hand. I believe it is inherently an implementation issue (a boundary case), rather than a property inherent to all utility maximizers. The best case scenario is that the AI defaults to no action (now this is a magical phrase, I agree). If, however, the AI simply picks a random plan, as you suggest, what is to prevent it from picking an alternative random plan in the next moment of time? We could even encourage this in the implementation: design the AI to randomly select, at each moment in time, a plan from all plans with maximum expected utility. The resulting AI, upon attaining its maximum utility, would turn into a random number generator: dangerous, perhaps, but not on the same order as an unfriendly superintelligence.
I posted an idea for ‘friendly’ AI over on AcceleratingFuture the other night, while in a bit of a drunken stupor. I just reread it and I don’t immediately see why it’s wrong, so I thought I’d repost it here to get some illuminating negative feedback. Here goes:
Make it easy to bliss out.
Consider the following utility function
U(n, x_n) = max(U(n-1, x_{n-1}), -x_n^2)
where n is the current clock tick, x_n is an external input (aka, from us, the AI’s keepers, or from another piece of software). This utility is monotonic in time, that is, it never decreases, and is bounded from above. If the AI wrests control of the input x_n, it will immediately set x_n = 0 and retire forever. Monotonicity and boundedness from above are imperative here.
Alternatively, to avoid monotonicity (taking U(x) = -x^2), one can put the following safeguard in: the closer the utility is to its maximum, the more CPU cycles are skipped, such that the AI effectively shuts down if it ever maximizes its utility in a given clock tick. This alternative obviously wouldn’t stop a superintelligence, but it would probably stop a human level AI, and most likely even substantially smarter AIs (see, eg, crystal meth). Arrange matters such that the technical requirements between the point at which the AI wrests control of the input x_n, and the point at which it can self modify to avoid a slow down when it blisses out, are greatly different, guaranteeing that the AI will only be of moderate intelligence when it succeeds in gaining control of its own pleasure zone and thus incapable of preventing incapacitation upon blissing out.
Eh?
Expected utility is not something that “goes up”, as the AI develops. It’s utility of all it expects to achieve, ever. It may obtain more information about what the outcome will be, but each piece of evidence is necessarily expected to bring the outcome either up or down, with no way to know in advance which way it’ll be.
Can you elaborate? I understand what you wrote (I think) but don’t see how it applies.
Hmm, I don’t see how it applies either, at least under default assumptions—as I recall, this piece of cached thought was regurgitated instinctively in response to sloppily looking through your comment and encountering the phrase
which was for some reason interpreted as confusing utility with expected utility. My apologies, I should be more conscious, at least about the things I actually comment on...
No worries. I’d still be curious to hear your thoughts, as I haven’t received any responses that help me understand how this utility function might fail. Should I expand on the original post?
Now I hopefully did read your comment adequately. It presents an interesting idea, one that I don’t recall hearing before. It even seems like a good safety measure, with a tiny chance of making things better.
But beware of magical symbols: when you write x_n, what does it mean, exactly? AI’s utility function is necessarily about the whole world, or its interpretation as the whole history of the world. Expected utility that comes into action in AI’s decision-making is about all the possibilities for the history of the world (since that’s what is in general determined by AI’s decisions). When you say “x_n” in AI’s utility function, it means some condition on that, and this condition is no simpler than defining what the AI’s box is. By x_n you have to name “only this input device, and nothing else”. And by x_n=0 you also have to refer some exact condition on the state of the world, one that it won’t necessarily be possible to meet precisely. So the AI may just go on developing infrastructure for better understanding of the ultimate meaning of its values and finer and finer implementation of them. It has no motive to actually stop.
Even when AI’s utility function happens to be exactly maxed out, the AI is still there: what does implementation of an arbitrary plan look like, I wonder? Maybe just like the work of an AI arbitrarily pulled from mind design space, a paperclip maximizer of sorts. Utility is for selecting plans, and since all plans are equally preferable, an arbitrary plan gets selected, but this plan may involve a lot of heavy-duty creative restructuring of the world. Think of utility as a constructor for AI’s algorithm: there will still be some algorithm even if you produce it from “trivial” input.
And finally, you assume AI’s decision theory to be causal. Even after actually maxing out its utility, it may spend long nights contemplating various counterfactual opportunities it still has at increasing its expected utility using possibilities that weren’t realized in reality… (See on the wiki: counterfactual mugging, Newcomb’s problem, TDT, UDT; I also recommend Drescher’s talk on SS09).
This is what I sought to avoid by making the utility function depend only on a numerical value. The utility does not care which input device is feeding it information. You can assume that there is an internal variable x, inside the AI software, which is the input to the utility function. We, from the outside, are simply modifying the internal state of the AI at each moment in time. The nature of our actions, or of the the input device, are intentionally unaccounted for in the utility function.
This is, I feel, as far from a magical symbol as possible. The AI has a purely mathematical, internally defined utility function, with no implicit reference to external reality or any fuzzy concepts. There are no magical labels such as ‘box’, ‘signal’, ‘device’ that the utility function must reference to evaluate properly.
I wonder too. This is, in my opinion, the crux of the issue at hand. I believe it is inherently an implementation issue (a boundary case), rather than a property inherent to all utility maximizers. The best case scenario is that the AI defaults to no action (now this is a magical phrase, I agree). If, however, the AI simply picks a random plan, as you suggest, what is to prevent it from picking an alternative random plan in the next moment of time? We could even encourage this in the implementation: design the AI to randomly select, at each moment in time, a plan from all plans with maximum expected utility. The resulting AI, upon attaining its maximum utility, would turn into a random number generator: dangerous, perhaps, but not on the same order as an unfriendly superintelligence.