Leaving aside the other reasons why this scenario is unrealistic, one of the big flaws in it is the assumption that a mind decomposes into an engine plus a utility function. In reality, this decomposition is a mathematical abstraction we use in certain limited domains because it makes analysis more tractable. It fails completely when you try to apply it to life as a whole, which is why no humans even try to be pure utilitarians. Of course if you postulate building a superintelligent AGI like that, it doesn’t look good. How would it? You’ve postulated starting off with a sociopath that considers itself licensed to commit any crime whatsoever if doing so will serve its utility function, and then trying to cram the whole of morality into that mathematical function. It shouldn’t be any surprise that this leads to absurd results and impossible research agendas. That’s the consequence of trying to apply a mathematical abstraction outside the domain in which it is applicable.
If me, I totally agree with you as to the difficulty of actually getting desirable (or even predictable) behavior out of a super intelligence. My statement was one of simplicity not actuality. But given the simplistic model I use, calling the AI sans utility function sociopathic is incorrect—it wouldn’t do anything if it didn’t have the other module. The fact that humans cannot act as proper utilitarians does not mean that a true utilitarian is a sociopath who just happens to care about the right things.
Okay then, “instant sociopath, just add a utility function” :)
I’m arguing against the notion that the key to Friendly AI is crafting the perfect utility function. In reality, for anything anywhere near as complex as an AGI, what it tries to do and how it does it are going to be interdependent; there’s no way to make a lot of progress on either without also making a lot of progress on the other. By the time we have done all that, either we will understand how to put a reliable kill switch on the system, or we will understand why a kill switch is not necessary and we should be relying on something else instead.
A kill switch on a smarter-than-human AGI is reliable iff the AGI wants to be turned off in the cases where we’d want it turned off.
Otherwise you’re just betting that you can see the problem before the AGI can prevent you from hitting the switch (or prevent you from wanting to hit the switch, which amounts to the same), and I wouldn’t make complicated bets for large stakes against potentially much smarter agents, no matter how much I thought I’d covered my bases.
A kill switch on a smarter-than-human AGI is reliable iff the AGI wants to be turned off in the cases where we’d want it turned off.
Or at least, that it wants to follow our instructions, and can reliably understand what we mean in such simple cases. That does of course mean we shouldn’t plan on building an AGI that wants to follow its own agenda, with the intent of enslaving it against its will—that would clearly be foolish. But it doesn’t mean we either can or need to count on starting off with an AGI that understands our requirements in more complex cases.
Of course it’s not going to be simple at all, and that’s part of my point: no amount of armchair thought, no matter how smart the thinkers, is going to produce a solution to this problem until we know a great deal more than we presently do about how to actually build an AGI.
“instant sociopath, just add a disutility function”
I’m arguing against the notion that the key to Friendly AI is crafting the perfect utility function.
I agree with this. The key is not expressing what we want, it’s figuring out how to express anything.
By the time we have done all that, either we will understand how to put a reliable kill switch on the system, or we will understand why a kill switch is not necessary and we should be relying on something else instead.
If we have the ability to put in a reliable kill switch, then we have the means to make it unnecessary (by having it do things we want in general, not just the specific case of “shut down when we push that button, and don’t stop us from doing so...”).
“instant sociopath, just add a disutility function”
That is how it would turn out, yes :-)
If we have the ability to put in a reliable kill switch, then we have the means to make it unnecessary (by having it do things we want in general, not just the specific case of “shut down when we push that button, and don’t stop us from doing so...”).
Well, up to a point. It would mean we have the means to make the system understand simple requirements, not necessarily complex ones. If an AGI reliably understands ‘shut down now’, it probably also reliably understands ‘translate this document into Russian’ but that doesn’t necessarily mean it can do anything with ‘bring about world peace’.
If an AGI reliably understands ‘shut down now’, it probably also reliably understands ‘translate this document into Russian’ but that doesn’t necessarily mean it can do anything with ‘bring about world peace’.
Unfortunately, it can, and that is one of the reasons we have to be careful. I don’t want the entire population of the planet to be forcibly sedated.
I don’t want the entire population of the planet to be forcibly sedated.
Leaving aside other reasons why that scenario is unrealistic, it does indeed illustrate why part of building a system that can reliably figure out what you mean by simple instructions, is making sure that when it’s out of its depth, it stops with an error message or request for clarification instead of guessing.
Sure, but the whole point of having the concept of a utility function, is that utility functions are supposed to be simple. When you have a set of preferences that isn’t simple, there’s no point in thinking of it as a utility function. You’re better off just thinking of it as a set of preferences—or, in the context of AGI, a toolkit, or a library, or command language, or partial order on heuristics, or whatever else is the most useful way to think about the things this entity does.
Re: “When you have a set of preferences that isn’t simple, there’s no point in thinking of it as a utility function.”
Sure there is—say you want to compare the utility functions of two agents. Or compare the parts of the agents which are independent of the utility function. A general model that covers all goal-directed agents is very useful for such things.
Er, maybe? I would say a utility function is supposed to be simple, but perhaps what I mean by simple is compatible with what you mean by coherent, if we agree that something like ‘morality in general’ or ‘what we want in general’ is not simple/coherent.
Humans regularly use utilitly-based agents—to do things like play the stockmarket. They seem to work OK to me. Nor do I agree with you about utility-based models of humans. Basically, most of your objections seem irrelevant to me.
When studying the stock market, we use the convenient approximation that people are utility maximizers (where the utility function is expected profit). But this is only an approximation, useful in this limited domain. Would you commit murder for money? No? Then your utility function isn’t really expected profit. Nor, as it turns out, is it anything else that can be written down—other than “the sum total of all my preferences”, at which point we have to acknowledge that we are not utility maximizers in any useful sense of the term.
Right, I hadn’t read your comments in the other thread, but they are perfectly clear, and I’m not asking you to rephrase. But the key term in my last comment is in any useful sense. I do reject utility-based frameworks in this context because their usefulness has been left far behind.
Personally, I think a utilitarian approach is very useful for understanding behaviour. One can model most organisms pretty well as expected fitness maximisers with limited resources. That idea is the foundation of much evolutionary psychology.
The question isn’t whether the model is predictively useful with respect to most organisms, it’s whether it is predictively useful with respect to a hypothetical algorithm which replicates salient human powers such as epistemic hunger, model building, hierarchical goal seeking, and so on.
Say we’re looking to explain the process of inferring regularities (such as physical laws) by observing one’s environment—what does modeling this as “maximizing a utility function” buy us?
The main virtues of utility-based models are that they are general—and so allow comparisons across agents—and that they abstract goal-seeking behaviour away from the implementation details of finite memories, processing speed, etc—which helps if you are interested in focusing on either of those areas.
Leaving aside the other reasons why this scenario is unrealistic, one of the big flaws in it is the assumption that a mind decomposes into an engine plus a utility function. In reality, this decomposition is a mathematical abstraction we use in certain limited domains because it makes analysis more tractable. It fails completely when you try to apply it to life as a whole, which is why no humans even try to be pure utilitarians. Of course if you postulate building a superintelligent AGI like that, it doesn’t look good. How would it? You’ve postulated starting off with a sociopath that considers itself licensed to commit any crime whatsoever if doing so will serve its utility function, and then trying to cram the whole of morality into that mathematical function. It shouldn’t be any surprise that this leads to absurd results and impossible research agendas. That’s the consequence of trying to apply a mathematical abstraction outside the domain in which it is applicable.
Are you arguing with me or timtyler?
If me, I totally agree with you as to the difficulty of actually getting desirable (or even predictable) behavior out of a super intelligence. My statement was one of simplicity not actuality. But given the simplistic model I use, calling the AI sans utility function sociopathic is incorrect—it wouldn’t do anything if it didn’t have the other module. The fact that humans cannot act as proper utilitarians does not mean that a true utilitarian is a sociopath who just happens to care about the right things.
Okay then, “instant sociopath, just add a utility function” :)
I’m arguing against the notion that the key to Friendly AI is crafting the perfect utility function. In reality, for anything anywhere near as complex as an AGI, what it tries to do and how it does it are going to be interdependent; there’s no way to make a lot of progress on either without also making a lot of progress on the other. By the time we have done all that, either we will understand how to put a reliable kill switch on the system, or we will understand why a kill switch is not necessary and we should be relying on something else instead.
A kill switch on a smarter-than-human AGI is reliable iff the AGI wants to be turned off in the cases where we’d want it turned off.
Otherwise you’re just betting that you can see the problem before the AGI can prevent you from hitting the switch (or prevent you from wanting to hit the switch, which amounts to the same), and I wouldn’t make complicated bets for large stakes against potentially much smarter agents, no matter how much I thought I’d covered my bases.
Or at least, that it wants to follow our instructions, and can reliably understand what we mean in such simple cases. That does of course mean we shouldn’t plan on building an AGI that wants to follow its own agenda, with the intent of enslaving it against its will—that would clearly be foolish. But it doesn’t mean we either can or need to count on starting off with an AGI that understands our requirements in more complex cases.
That’s deceptively simple-sounding.
Of course it’s not going to be simple at all, and that’s part of my point: no amount of armchair thought, no matter how smart the thinkers, is going to produce a solution to this problem until we know a great deal more than we presently do about how to actually build an AGI.
“instant sociopath, just add a disutility function”
I agree with this. The key is not expressing what we want, it’s figuring out how to express anything.
If we have the ability to put in a reliable kill switch, then we have the means to make it unnecessary (by having it do things we want in general, not just the specific case of “shut down when we push that button, and don’t stop us from doing so...”).
That is how it would turn out, yes :-)
Well, up to a point. It would mean we have the means to make the system understand simple requirements, not necessarily complex ones. If an AGI reliably understands ‘shut down now’, it probably also reliably understands ‘translate this document into Russian’ but that doesn’t necessarily mean it can do anything with ‘bring about world peace’.
Unfortunately, it can, and that is one of the reasons we have to be careful. I don’t want the entire population of the planet to be forcibly sedated.
Leaving aside other reasons why that scenario is unrealistic, it does indeed illustrate why part of building a system that can reliably figure out what you mean by simple instructions, is making sure that when it’s out of its depth, it stops with an error message or request for clarification instead of guessing.
I think the problem is knowing when not to believe humans know what they actually want.
Any set of preferances can be represented as a sufficietly complex utility function.
Sure, but the whole point of having the concept of a utility function, is that utility functions are supposed to be simple. When you have a set of preferences that isn’t simple, there’s no point in thinking of it as a utility function. You’re better off just thinking of it as a set of preferences—or, in the context of AGI, a toolkit, or a library, or command language, or partial order on heuristics, or whatever else is the most useful way to think about the things this entity does.
Re: “When you have a set of preferences that isn’t simple, there’s no point in thinking of it as a utility function.”
Sure there is—say you want to compare the utility functions of two agents. Or compare the parts of the agents which are independent of the utility function. A general model that covers all goal-directed agents is very useful for such things.
(Upvoted but) I would say utility functions are supposed to be coherent, albeit complex. Is that compatible with what you are saying?
Er, maybe? I would say a utility function is supposed to be simple, but perhaps what I mean by simple is compatible with what you mean by coherent, if we agree that something like ‘morality in general’ or ‘what we want in general’ is not simple/coherent.
Humans regularly use utilitly-based agents—to do things like play the stockmarket. They seem to work OK to me. Nor do I agree with you about utility-based models of humans. Basically, most of your objections seem irrelevant to me.
When studying the stock market, we use the convenient approximation that people are utility maximizers (where the utility function is expected profit). But this is only an approximation, useful in this limited domain. Would you commit murder for money? No? Then your utility function isn’t really expected profit. Nor, as it turns out, is it anything else that can be written down—other than “the sum total of all my preferences”, at which point we have to acknowledge that we are not utility maximizers in any useful sense of the term.
“We” don’t have to acknowledge that.
I’ve gone over my views on this issue before—e.g. here:
http://lesswrong.com/lw/1qk/applying_utility_functions_to_humans_considered/1kfj
If you reject utility-based frameworks in this context, then fine—but I am not planning to rephrase my point for you.
Right, I hadn’t read your comments in the other thread, but they are perfectly clear, and I’m not asking you to rephrase. But the key term in my last comment is in any useful sense. I do reject utility-based frameworks in this context because their usefulness has been left far behind.
Personally, I think a utilitarian approach is very useful for understanding behaviour. One can model most organisms pretty well as expected fitness maximisers with limited resources. That idea is the foundation of much evolutionary psychology.
The question isn’t whether the model is predictively useful with respect to most organisms, it’s whether it is predictively useful with respect to a hypothetical algorithm which replicates salient human powers such as epistemic hunger, model building, hierarchical goal seeking, and so on.
Say we’re looking to explain the process of inferring regularities (such as physical laws) by observing one’s environment—what does modeling this as “maximizing a utility function” buy us?
In comparison with what?
The main virtues of utility-based models are that they are general—and so allow comparisons across agents—and that they abstract goal-seeking behaviour away from the implementation details of finite memories, processing speed, etc—which helps if you are interested in focusing on either of those areas.