Why is 0 a lower bound for disutility? Suppose I make a machine that makes one person one Standard Happiness Unit happier than they’d otherwise have been and then stops and self-destructs; isn’t that a disutility of −1 units?
If what you mean by minimizing disutility is that the machine tries not to cause harm on balance and doesn’t care about any good it does, then I agree with Lumifer (and don’t understand why he’s got all those downvotes for saying it): the trivial zero-risk solution is to shut down immediately without doing anything, and nothing else you do is going to improve on that.
So I guess you have something else in mind. E.g., are you proposing to decompose everyone’s experiences into good and bad, and try to minimize the amount of bad without regard for the amount of good? That doesn’t seem like it can work either (instantly and painlessly killing everyone guarantees zero disutility thereafter), so again it probably isn’t what you mean.
OK, third try. Perhaps you mean that you look at all consequences of what the AI does (compared, I guess, with a world where it doesn’t do anything), and split those into positive and negative consequences, and try to minimize the expected sum of all the negative ones. The problem with this (I think) is that it’s not clear how you should actually split things up; I don’t see that there’s a canonical way. And also that it seems unlikely that any nontrivial action has no negative consequences, in which case once again the optimum is going to be the trivial one of never doing anything.
I’m sorry my post was so ambiguous. I’ll try to put the idea in clearer words.
Disutility, like utility, is a learning machine’s internal representation of it valuation of the environment. The machine observes its environment (virutal or physical) and runs a disutility function on its observations to establish how “desirable” the environment is according to its disutility function.
Example: A pest control drone patrols the are it is programmed to patrol. Its disutility function is “number of small insects that are not butterflies, and spiders + ((humans harmed by my actions)*1000000)”.
It decides how to act by modelling the possible worlds that result from implementing things it can do and choosing the one with the lowest expected disutility, as long as at least one is above some arbitrary “shutdown threshold” where nothing the drone can “imagine” is good enough.
In the example: The drone might model what would happen if it pointed a mosquito-zapping laser at a bug that it sees. If the world with a zapped bug has less disutility, it’ll do that. If that wouldn’t work because, say, the bug is sitting on some human’s lap, it will not do that but instead try one of its other options, like wait until the bug presents a better target or move on to a different part of the area.
And if there is no disutility to reduce—because the calculated disutility is 0 and nothing the system could do would reduce it further—the system does nothing. This is the difference from a utility maximizer, because utility seems to always be (at least implicitly) unbounded.
Of course this still presents the obvious failure mode where the system is prone hack itself and change the internal representation. In the bug-zapping drone, the drone might find the best way to have 0 disutility is to turn off its cameras and simply not see any bugs. This remains a serious problem. But at least at that point the system shuts down “satisfied”, rather than turn its future light cone into computronium in order to represent ever higher internal representations of utility.
Suppose we are considering an agent with a more “positive” mission than that of your pest control drone (whose purpose is best expressed negatively: get rid of small pests). For instance, perhaps the agent is working for a hedge fund and trying to increase the value of its holdings, or perhaps it’s trying to improve human health and give people longer healthier lives.
How do you express that in terms of “disutility”?
I think what is doing the work here is not using “disutility” rather than “utility”, but having a utility function that’s (something like) bounded above and that can’t be driven sky-high by (what we would regard as) weird and counterproductive actions. (So, for the “positive” agents above, rather than forcing what they do into a “disutility” framework, one could give the hedge fund machine a utility function that stops increasing after the value of the fund reaches $100bn, and the health machine a utility function that stops increasing after 95% of people are getting 70QALYs or more, or something like that.) And then some counterbalancing, not artificially bounded, negative term (“number of humans harmed” in your example; maybe more generally some measure of “amount of change” would do, though I suspect that would be hard to express rigorously) should ensure that the machine never has reason to do anything too drastic.
So: yeah, I think this is far from crazy, but I don’t think it’s going to solve the Friendly AI problem, for a few reasons:
A system of this kind can only ever do a limited amount of good. I suppose you get around that by making a new one with a broadly similar utility function but larger bounds, once the first one has finished its job without destroying humanity. The overall effect is a kind of hill-climbing algorithm: improve the world as much as you like, but each step has to be not too large and human beings step in and take stock after each step.
You are at risk of being overtaken by another system with fewer scruples about large changes—in particular, by one that doesn’t require repeated human intervention.
Relatedly, this doesn’t seem like the kind of restriction that’s stable under self-modification; we aren’t going to bootstrap our way to a quick positive singularity this way without serious risk of disaster.
To be sure that a system of this kind really is safe, that “don’t do too much harm” term in its (dis)utility function really wants to be quite general. (Caricature of the kind of failure you want to avoid: your bug-killer figures out a new insecticide and a means of delivering it widely; it doesn’t harm anyone now alive, but it does have reproductive effects, with the eventual consequence that people two generations from now will be 20 IQ points stupider or something. But no particular person is worse off.) But (1) this is going to be really hard to specify and (2) it’s likely that everything the system can think of has some long-range consequences that might be bad, so very likely it ends up never doing anything.
I agree on all points. It seems “bounded utility” might be a better term than “disutility”. The main point is that a halting condition triggered by success, and a system that is essentially trying to find the conditions where it can shut itself off, seems less likely to go horribly wrong than an unbounded search for ever more utility.
This is not an attempt to solve Friendly AI. I just figure a simple hard-coded limit to how much of anything a learning machine could want chops off a couple of avenues for disaster.
Why is 0 a lower bound for disutility? Suppose I make a machine that makes one person one Standard Happiness Unit happier than they’d otherwise have been and then stops and self-destructs; isn’t that a disutility of −1 units?
If what you mean by minimizing disutility is that the machine tries not to cause harm on balance and doesn’t care about any good it does, then I agree with Lumifer (and don’t understand why he’s got all those downvotes for saying it): the trivial zero-risk solution is to shut down immediately without doing anything, and nothing else you do is going to improve on that.
So I guess you have something else in mind. E.g., are you proposing to decompose everyone’s experiences into good and bad, and try to minimize the amount of bad without regard for the amount of good? That doesn’t seem like it can work either (instantly and painlessly killing everyone guarantees zero disutility thereafter), so again it probably isn’t what you mean.
OK, third try. Perhaps you mean that you look at all consequences of what the AI does (compared, I guess, with a world where it doesn’t do anything), and split those into positive and negative consequences, and try to minimize the expected sum of all the negative ones. The problem with this (I think) is that it’s not clear how you should actually split things up; I don’t see that there’s a canonical way. And also that it seems unlikely that any nontrivial action has no negative consequences, in which case once again the optimum is going to be the trivial one of never doing anything.
I’m sorry my post was so ambiguous. I’ll try to put the idea in clearer words.
Disutility, like utility, is a learning machine’s internal representation of it valuation of the environment. The machine observes its environment (virutal or physical) and runs a disutility function on its observations to establish how “desirable” the environment is according to its disutility function.
Example: A pest control drone patrols the are it is programmed to patrol. Its disutility function is “number of small insects that are not butterflies, and spiders + ((humans harmed by my actions)*1000000)”.
It decides how to act by modelling the possible worlds that result from implementing things it can do and choosing the one with the lowest expected disutility, as long as at least one is above some arbitrary “shutdown threshold” where nothing the drone can “imagine” is good enough.
In the example: The drone might model what would happen if it pointed a mosquito-zapping laser at a bug that it sees. If the world with a zapped bug has less disutility, it’ll do that. If that wouldn’t work because, say, the bug is sitting on some human’s lap, it will not do that but instead try one of its other options, like wait until the bug presents a better target or move on to a different part of the area.
And if there is no disutility to reduce—because the calculated disutility is 0 and nothing the system could do would reduce it further—the system does nothing. This is the difference from a utility maximizer, because utility seems to always be (at least implicitly) unbounded.
Of course this still presents the obvious failure mode where the system is prone hack itself and change the internal representation. In the bug-zapping drone, the drone might find the best way to have 0 disutility is to turn off its cameras and simply not see any bugs. This remains a serious problem. But at least at that point the system shuts down “satisfied”, rather than turn its future light cone into computronium in order to represent ever higher internal representations of utility.
Suppose we are considering an agent with a more “positive” mission than that of your pest control drone (whose purpose is best expressed negatively: get rid of small pests). For instance, perhaps the agent is working for a hedge fund and trying to increase the value of its holdings, or perhaps it’s trying to improve human health and give people longer healthier lives.
How do you express that in terms of “disutility”?
I think what is doing the work here is not using “disutility” rather than “utility”, but having a utility function that’s (something like) bounded above and that can’t be driven sky-high by (what we would regard as) weird and counterproductive actions. (So, for the “positive” agents above, rather than forcing what they do into a “disutility” framework, one could give the hedge fund machine a utility function that stops increasing after the value of the fund reaches $100bn, and the health machine a utility function that stops increasing after 95% of people are getting 70QALYs or more, or something like that.) And then some counterbalancing, not artificially bounded, negative term (“number of humans harmed” in your example; maybe more generally some measure of “amount of change” would do, though I suspect that would be hard to express rigorously) should ensure that the machine never has reason to do anything too drastic.
So: yeah, I think this is far from crazy, but I don’t think it’s going to solve the Friendly AI problem, for a few reasons:
A system of this kind can only ever do a limited amount of good. I suppose you get around that by making a new one with a broadly similar utility function but larger bounds, once the first one has finished its job without destroying humanity. The overall effect is a kind of hill-climbing algorithm: improve the world as much as you like, but each step has to be not too large and human beings step in and take stock after each step.
You are at risk of being overtaken by another system with fewer scruples about large changes—in particular, by one that doesn’t require repeated human intervention.
Relatedly, this doesn’t seem like the kind of restriction that’s stable under self-modification; we aren’t going to bootstrap our way to a quick positive singularity this way without serious risk of disaster.
To be sure that a system of this kind really is safe, that “don’t do too much harm” term in its (dis)utility function really wants to be quite general. (Caricature of the kind of failure you want to avoid: your bug-killer figures out a new insecticide and a means of delivering it widely; it doesn’t harm anyone now alive, but it does have reproductive effects, with the eventual consequence that people two generations from now will be 20 IQ points stupider or something. But no particular person is worse off.) But (1) this is going to be really hard to specify and (2) it’s likely that everything the system can think of has some long-range consequences that might be bad, so very likely it ends up never doing anything.
I agree on all points. It seems “bounded utility” might be a better term than “disutility”. The main point is that a halting condition triggered by success, and a system that is essentially trying to find the conditions where it can shut itself off, seems less likely to go horribly wrong than an unbounded search for ever more utility.
This is not an attempt to solve Friendly AI. I just figure a simple hard-coded limit to how much of anything a learning machine could want chops off a couple of avenues for disaster.