EY has read With Folded Hands and mentioned it in his CEV writeup as one more dystopia to be averted. This task isn’t getting much attention now because unfriendly AI seems to be more probable and more dangerous than almost-friendly AI. Of course we would welcome any research on preventing almost-friendly AI :-)
Either. The main reason creating almost-Friendly AI isn’t a concern is that it’s believed to be practically as hard as creating Friendly AI. Someone who tries to create a Friendly AI and fails creates an Unfriendly AI or no AI at all. And almost-Friendly might be enough to keep us from being hit by meteors and such.
In the real world if I believe that “anyone who isn’t my enemy is my friend” and you believe that “anyone who isn’t my friend is my enemy”, we believe different things. (And we’re both wrong: the truth is some people are neither my friends nor my enemies.) I assume that’s what xxd is getting at here. I think it would be more precise for xxd to say “I don’t believe that NOT(FAI) is a bad thing that we should be working to avoid. I believe that NOT(UFAI) is a good thing that we should be working to achieve.”
In this xxd does in fact disagree with the articulated LW consensus, which is that the design space of human-created AI is so dangerous that if an AI isn’t provably an FAI, we ought not even turn it on… that any AI that isn’t Friendly constitutes an existential risk.
Xxd may well be wrong, but xxd is not saying something incoherent here.
In the real world if I believe that “anyone who isn’t my enemy is my friend” and you believe that “anyone who isn’t my friend is my enemy”, we believe different things.
Can you explain what those things are? I can’t see the distinction. The first follows necessarily from the second, and vice-versa.
I’ve known Sam since we were kids together, we enjoy each others’ company and act in one another’s interests. I’ve known Doug since we were kids together, we can’t stand one another and act against one another’s interests. I’ve never met Ethel in my life and know nothing about her; she lives on the other side of the planet and has never heard of me.
It seems fair to say that Sam is my friend, and Doug is my enemy. But what about Ethel?
If I believe “anyone who isn’t my enemy is my friend,” then I can evaluate Ethel for enemyhood. Do we dislike one another? Do we act against one another’s interests? No, we do not. Thus we aren’t enemies… and it follows from my belief that Ethel is my friend.
If I believe “anyone who isn’t my friend is my enemy,” then I can evaluate Ethel for friendhood. Do we like one another? Do we act in one another’s interests? No, we do not. Thus we aren’t friends… and it follows from my belief that Ethel is my enemy.
I think it more correct to say that Ethel is neither my friend nor my enemy. Thus, I consider Ethel an example of someone who isn’t my friend, and isn’t my enemy. Thus I think both of those beliefs are false. But even if I’m wrong, it seems clear that they are different beliefs, since they make different predictions about Ethel.
If I believe “anyone who isn’t my enemy is my friend,” then I can evaluate Ethel for enemyhood. Do we dislike one another? Do we act against one another’s interests? No, we do not. Thus we aren’t enemies… and it follows from my belief that Ethel is my friend.
If I believe “anyone who isn’t my friend is my enemy,” then I can evaluate Ethel for friendhood. Do we like one another? Do we act in one another’s interests? No, we do not. Thus we aren’t friends… and it follows from my belief that Ethel is my enemy.
Thanks—that’s interesting.
It seems to me that this analysis only makes sense if you actually have the non-excluded middle of “neither my friend nor my enemy”. Once you’ve accepted that the world is neatly carved up into “friends” and “enemies”, it seems you’d say “I don’t know whether Ethel is my friend or my enemy”—I don’t see why the person in the first case doesn’t just as well evaluate Ethel for friendhood, and thus conclude she isn’t an enemy. Note that one who believes “anyone who isn’t my enemy is my friend” also should thus believe “anyone who isn’t my friend is my enemy” as a (logically equivalent) corollary.
Am I missing something here about the way people talk / reason? I can’t really imagine thinking that way.
Edit: In case it wasn’t clear enough that they’re logically equivalent:
Yes, I agree that if everyone in the world is either my friend or my enemy, then “anyone who isn’t my enemy is my friend” is equivalent to “anyone who isn’t my friend is my enemy.”
But there do, in fact, exist people who are neither my friend nor my enemy.
If “everyone who is not my friend is my enemy”, then there does not exist anyone who is neither my friend nor my enemy. You can therefore say that the statement is wrong, but the statements are equivalent without any extra assumptions.
ISTM that the two statements are equivalent denotationally (they both mean “each person is either my friend or my enemy”) but not connotationally (the first suggests that most people are my friends, the latter suggests that most people are my enemies).
In other words, there are things that are friends. There are things that are enemies. It takes a separate assertion that those are the only two categories (as opposed to believing something like “some people are indifferent to me”).
In relation to AI, there is malicious AI (the Straumli Perversion), indifferent AI (Accelerando AI), and FAI. When EY says uFAI, he means both malicious and indifferent. But it is a distinct insight to say that indifferent AI are practically as dangerous as malicious AI. For example, it is not obvious that an AI whose only goal is to leave the Milky Way galaxy (and is capable of trying without directly harming humanity) is too dangerous to turn on. Leaving aside the motivation for creating such an entity, I certainly would agree with EY that such an entity has a substantial chance of being an existential risk to humanity.
This seems mostly like a terminological dispute. But I think AI that doesn’t care about humanity (i.e the various AI in Accelerando) are best labeled unfriendly even though they are not trying to end humanity or kill any particular human.
I can’t imagine a situation in which the AGI is sort-of kind to us—not killing good people, letting us keep this solar system—but which also does some unfriendly things, like killing bad people or taking over the rest of the galaxy (both pretty terrible things in themselves, even if they’re not complete failures), unless that’s what the AI’s creator wanted—i.e. the creator solved FAI but managed to, without upsetting the whole thing, include in the AI’s utility function terms for killing bad people and caring about something completely alien outside the solar system. They’re not outcomes that you can cause by accident—and if you can do that, then you can also solve full FAI, without killing bad people or tiling the rest of the galaxy.
I guess what I’m saying is that we’ve gotten involved in a compression fallacy and are saying that Friendly AI = AI that helps out humanity (or is kind to humanity—insert favorite “helps” derivative here).
Here’s an example: I’m “sort of friendly” in that I don’t actively go around killing people, but neither will I go around actively helping you unless you want to trade with me. Does that make me unfriendly? I say no it doesn’t.
Well, I don’t suppose anyone feels the need to draw a bright-line distinction between FAI and uFAI—the AI is more friendly the more its utility function coincides with your own. But in practice it doesn’t seem like any AI is going to fall into the gap between “definitely unfriendly” and “completely friendly”—to create such a thing would be a more fiddly and difficult engineering problem than just creating FAI. If the AI doesn’t care about humans in the way that we want them to, it almost certainly takes us apart and uses the resources to create whatever it does care about.
EDIT: Actually, thinking about it, I suppose one potential failure mode which falls into the grey territory is building an AI that just executes peoples’ current volition without trying to extrapolate. I’m not sure how fast this goes wrong or in what way, but it doesn’t strike me as a good idea.
I suppose one potential failure mode which falls into the grey territory is building an AI that just executes peoples’ current volition without trying to extrapolate. I’m not sure how fast this goes wrong or in what way, but it doesn’t strike me as a good idea.
“I suppose one potential failure mode which falls into the grey territory is building an AI that just executes peoples’ current volition without trying to extrapolate”
i.e. the device has to judge the usefulness by some metric and then decide to execute someone’s volition or not.
That’s exactly what my issue is with trying to define a utility function for the AI. You can’t. And since some people will have their utility function denied by the AI then who is to choose who get’s theirs executed?
I’d prefer to shoot for a NOT(UFAI) and then trade with it.
Here’s a thought experiment:
Is a cure for cancer maximizing everyone’s utility function?
Yes on average we all win.
BUT
Companies who are currently creating drugs to treat the symptoms of cancer and their employees would be out of business.
Which utility function should be executed? Creating better cancer drugs to treat the symptoms and then allowing the company to sell them, or put the companies out of business and cure cancer.
Well, that’s an easy question: if you’ve worked sixteen hour days for the last forty years and you’re just six months away from curing cancer completely and you know you’re going to get the Nobel and be fabulously wealthy etc. etc. and an alien shows up and offers you a cure for cancer on a plate, you take it, because a lot of people will die in six months. This isn’t even different to how the world currently is—if I invented a cure for cancer it would be detrimental to all those others who were trying to (and who only cared about getting there first) - what difference does it make if an FAI helps me? I mean, if someone really wants to murder me but I don’t want them to and they are stopped by the police, that’s clearly an example of the government taking the side of my utility function over the murderer’s. But so what? The murderer was in the wrong.
Anyway, have you read Eliezer’s paper on CEV? I’m not sure that I agree with him, but he does deal with the problem you bring up.
Not necessarily friendly in the sense of being friendly to everyone as we all have differing utility functions, sometimes radically differing.
But I dispute the position that “if an AI doesn’t care about humans in the way we want them to, it almost certainly takes us apart and uses the resources to create whatever it does care about”.
Consider:
A totally unfriendly AI whose main goal is explicitly the extinction of humanity then turning itself off.
For us that’s an unfriendly AI.
One, however that doesn’t kill any of us but basically leaves us alone is defined by those of you who define “friendly AI” to be “kind to us”/”doing what we all want”/”maximizing our utility functions” etc is not unfriendly because by definition it doesn’t kill all of us.
Unless unfriendly also includes “won’t kill all of us but ignores us” et cetera.
Am I for example unfriendly to you if I spent my next month’s paycheck on paperclips but did you no harm?
Well, no. If it ignores us I probably wouldn’t call it “unfriendly”—but I don’t really mind if someone else does. It’s certainly not FAI. But an AI does need to have some utility function, otherwise it does nothing (and isn’t, in truth, intelligent at all), and will only ignore humanity if it’s explicitly programmed to. This ought to be as difficult an engineering problem as FAI—hence why I said it “almost certainly takes us apart”. You can’t get there by failing at FAI, except by being extremely lucky, and why would you want to go there on purpose?
Not necessarily friendly in the sense of being friendly to everyone as we all have differing utility functions, sometimes radically differing.
Yes, it would be a really bad idea to have a superintelligence optimise the world for just one person’s utility function.
“But an AI does need to have some utility function”
What if the “optimization of the utility function” is bounded like my own personal predilection with spending my paycheck on paperclips one time only and then stopping?
Is it sentient if it sits in a corner and thinks to itself, running simulations but won’t talk to you unless you offer it a trade e.g. of some paperclips?
Is it possible that we’re conflating “friendly” with “useful but NOT unfriendly” and we’re struggling with defining what “useful” means?
If it likes sitting in a corner and thinking to itself, and doesn’t care about anything else, it is very likely to turn everything around it (including us) into computronium so that it can think to itself better.
If you put a threshold on it to prevent it from doing stuff like that, that’s a little better, but not much. If it has a utility function that says “Think to yourself about stuff, but do not mess up the lives of humans in doing so”, then what you have now is an AI that is motivated to find loopholes in (the implementation of) that second clause, because anything that can get an increased fulfilment of the first clause will give it a higher utility score overall.
You can get more and more precise than that and cover more known failure modes with their own individual rules, but if it’s very intelligent or powerful it’s tough to predict what terrible nasty stuff might still be in the intersection of all the limiting conditions we create. Hidden complexity of wishes and all that jazz.
EY has read With Folded Hands and mentioned it in his CEV writeup as one more dystopia to be averted. This task isn’t getting much attention now because unfriendly AI seems to be more probable and more dangerous than almost-friendly AI. Of course we would welcome any research on preventing almost-friendly AI :-)
Or creating it. That might be good too.
The act or the research?
Either. The main reason creating almost-Friendly AI isn’t a concern is that it’s believed to be practically as hard as creating Friendly AI. Someone who tries to create a Friendly AI and fails creates an Unfriendly AI or no AI at all. And almost-Friendly might be enough to keep us from being hit by meteors and such.
I’m struggling with where the line lies.
I think pretty much everyone would agree that some variety of “makes humanity extinct by maximizing X” is unfriendly.
If however we have “makes bad people extinct by maximizing X and otherwise keeps P-Y of humanity alive” is that still unfriendly?
What about “leaves the solar system alone but tiles the rest of the galaxy” is that still unfriendly?
Can we try to close in on where the line is between friendly and unfriendly?
I really don’t believe we have NOT(FAI) = UFAI.
I believe it’s the other way around i.e. NOT(UFAI) = FAI.
Are you using some nonstandard logic where these statements are distinct?
In the real world if I believe that “anyone who isn’t my enemy is my friend” and you believe that “anyone who isn’t my friend is my enemy”, we believe different things. (And we’re both wrong: the truth is some people are neither my friends nor my enemies.) I assume that’s what xxd is getting at here. I think it would be more precise for xxd to say “I don’t believe that NOT(FAI) is a bad thing that we should be working to avoid. I believe that NOT(UFAI) is a good thing that we should be working to achieve.”
In this xxd does in fact disagree with the articulated LW consensus, which is that the design space of human-created AI is so dangerous that if an AI isn’t provably an FAI, we ought not even turn it on… that any AI that isn’t Friendly constitutes an existential risk.
Xxd may well be wrong, but xxd is not saying something incoherent here.
Can you explain what those things are? I can’t see the distinction. The first follows necessarily from the second, and vice-versa.
Consider three people: Sam, Ethel, and Doug.
I’ve known Sam since we were kids together, we enjoy each others’ company and act in one another’s interests. I’ve known Doug since we were kids together, we can’t stand one another and act against one another’s interests. I’ve never met Ethel in my life and know nothing about her; she lives on the other side of the planet and has never heard of me.
It seems fair to say that Sam is my friend, and Doug is my enemy. But what about Ethel?
If I believe “anyone who isn’t my enemy is my friend,” then I can evaluate Ethel for enemyhood. Do we dislike one another? Do we act against one another’s interests? No, we do not. Thus we aren’t enemies… and it follows from my belief that Ethel is my friend.
If I believe “anyone who isn’t my friend is my enemy,” then I can evaluate Ethel for friendhood. Do we like one another? Do we act in one another’s interests? No, we do not. Thus we aren’t friends… and it follows from my belief that Ethel is my enemy.
I think it more correct to say that Ethel is neither my friend nor my enemy. Thus, I consider Ethel an example of someone who isn’t my friend, and isn’t my enemy. Thus I think both of those beliefs are false. But even if I’m wrong, it seems clear that they are different beliefs, since they make different predictions about Ethel.
Thanks—that’s interesting.
It seems to me that this analysis only makes sense if you actually have the non-excluded middle of “neither my friend nor my enemy”. Once you’ve accepted that the world is neatly carved up into “friends” and “enemies”, it seems you’d say “I don’t know whether Ethel is my friend or my enemy”—I don’t see why the person in the first case doesn’t just as well evaluate Ethel for friendhood, and thus conclude she isn’t an enemy. Note that one who believes “anyone who isn’t my enemy is my friend” also should thus believe “anyone who isn’t my friend is my enemy” as a (logically equivalent) corollary.
Am I missing something here about the way people talk / reason? I can’t really imagine thinking that way.
Edit: In case it wasn’t clear enough that they’re logically equivalent:
Edit: long proof was long.
¬Fx → Ex ≡ Fx ∨ Ex ≡ ¬Ex → Fx
I’m guessing that the difference in the way language is actually used is a matter of which we are being pickier about, and which happens “by default”.
Yes, I agree that if everyone in the world is either my friend or my enemy, then “anyone who isn’t my enemy is my friend” is equivalent to “anyone who isn’t my friend is my enemy.”
But there do, in fact, exist people who are neither my friend nor my enemy.
If “everyone who is not my friend is my enemy”, then there does not exist anyone who is neither my friend nor my enemy. You can therefore say that the statement is wrong, but the statements are equivalent without any extra assumptions.
ISTM that the two statements are equivalent denotationally (they both mean “each person is either my friend or my enemy”) but not connotationally (the first suggests that most people are my friends, the latter suggests that most people are my enemies).
It’s equivocation fallacy.
In other words, there are things that are friends. There are things that are enemies. It takes a separate assertion that those are the only two categories (as opposed to believing something like “some people are indifferent to me”).
In relation to AI, there is malicious AI (the Straumli Perversion), indifferent AI (Accelerando AI), and FAI. When EY says uFAI, he means both malicious and indifferent. But it is a distinct insight to say that indifferent AI are practically as dangerous as malicious AI. For example, it is not obvious that an AI whose only goal is to leave the Milky Way galaxy (and is capable of trying without directly harming humanity) is too dangerous to turn on. Leaving aside the motivation for creating such an entity, I certainly would agree with EY that such an entity has a substantial chance of being an existential risk to humanity.
This seems mostly like a terminological dispute. But I think AI that doesn’t care about humanity (i.e the various AI in Accelerando) are best labeled unfriendly even though they are not trying to end humanity or kill any particular human.
I can’t imagine a situation in which the AGI is sort-of kind to us—not killing good people, letting us keep this solar system—but which also does some unfriendly things, like killing bad people or taking over the rest of the galaxy (both pretty terrible things in themselves, even if they’re not complete failures), unless that’s what the AI’s creator wanted—i.e. the creator solved FAI but managed to, without upsetting the whole thing, include in the AI’s utility function terms for killing bad people and caring about something completely alien outside the solar system. They’re not outcomes that you can cause by accident—and if you can do that, then you can also solve full FAI, without killing bad people or tiling the rest of the galaxy.
I don’t see why things of this form can’t be in the set of programs that I’d label “FAI with a bug”
Can I say “LOL” without being downvoted?
I guess what I’m saying is that we’ve gotten involved in a compression fallacy and are saying that Friendly AI = AI that helps out humanity (or is kind to humanity—insert favorite “helps” derivative here).
Here’s an example: I’m “sort of friendly” in that I don’t actively go around killing people, but neither will I go around actively helping you unless you want to trade with me. Does that make me unfriendly? I say no it doesn’t.
Well, I don’t suppose anyone feels the need to draw a bright-line distinction between FAI and uFAI—the AI is more friendly the more its utility function coincides with your own. But in practice it doesn’t seem like any AI is going to fall into the gap between “definitely unfriendly” and “completely friendly”—to create such a thing would be a more fiddly and difficult engineering problem than just creating FAI. If the AI doesn’t care about humans in the way that we want them to, it almost certainly takes us apart and uses the resources to create whatever it does care about.
EDIT: Actually, thinking about it, I suppose one potential failure mode which falls into the grey territory is building an AI that just executes peoples’ current volition without trying to extrapolate. I’m not sure how fast this goes wrong or in what way, but it doesn’t strike me as a good idea.
Conscious or unconscious volition? I think I can point to one possible failure mode :)
“I suppose one potential failure mode which falls into the grey territory is building an AI that just executes peoples’ current volition without trying to extrapolate”
i.e. the device has to judge the usefulness by some metric and then decide to execute someone’s volition or not.
That’s exactly what my issue is with trying to define a utility function for the AI. You can’t. And since some people will have their utility function denied by the AI then who is to choose who get’s theirs executed?
I’d prefer to shoot for a NOT(UFAI) and then trade with it.
Here’s a thought experiment:
Is a cure for cancer maximizing everyone’s utility function?
Yes on average we all win.
BUT
Companies who are currently creating drugs to treat the symptoms of cancer and their employees would be out of business.
Which utility function should be executed? Creating better cancer drugs to treat the symptoms and then allowing the company to sell them, or put the companies out of business and cure cancer.
Well, that’s an easy question: if you’ve worked sixteen hour days for the last forty years and you’re just six months away from curing cancer completely and you know you’re going to get the Nobel and be fabulously wealthy etc. etc. and an alien shows up and offers you a cure for cancer on a plate, you take it, because a lot of people will die in six months. This isn’t even different to how the world currently is—if I invented a cure for cancer it would be detrimental to all those others who were trying to (and who only cared about getting there first) - what difference does it make if an FAI helps me? I mean, if someone really wants to murder me but I don’t want them to and they are stopped by the police, that’s clearly an example of the government taking the side of my utility function over the murderer’s. But so what? The murderer was in the wrong.
Anyway, have you read Eliezer’s paper on CEV? I’m not sure that I agree with him, but he does deal with the problem you bring up.
More friendly to you. Yes.
Not necessarily friendly in the sense of being friendly to everyone as we all have differing utility functions, sometimes radically differing.
But I dispute the position that “if an AI doesn’t care about humans in the way we want them to, it almost certainly takes us apart and uses the resources to create whatever it does care about”.
Consider: A totally unfriendly AI whose main goal is explicitly the extinction of humanity then turning itself off. For us that’s an unfriendly AI.
One, however that doesn’t kill any of us but basically leaves us alone is defined by those of you who define “friendly AI” to be “kind to us”/”doing what we all want”/”maximizing our utility functions” etc is not unfriendly because by definition it doesn’t kill all of us.
Unless unfriendly also includes “won’t kill all of us but ignores us” et cetera.
Am I for example unfriendly to you if I spent my next month’s paycheck on paperclips but did you no harm?
Well, no. If it ignores us I probably wouldn’t call it “unfriendly”—but I don’t really mind if someone else does. It’s certainly not FAI. But an AI does need to have some utility function, otherwise it does nothing (and isn’t, in truth, intelligent at all), and will only ignore humanity if it’s explicitly programmed to. This ought to be as difficult an engineering problem as FAI—hence why I said it “almost certainly takes us apart”. You can’t get there by failing at FAI, except by being extremely lucky, and why would you want to go there on purpose?
Yes, it would be a really bad idea to have a superintelligence optimise the world for just one person’s utility function.
“But an AI does need to have some utility function”
What if the “optimization of the utility function” is bounded like my own personal predilection with spending my paycheck on paperclips one time only and then stopping?
Is it sentient if it sits in a corner and thinks to itself, running simulations but won’t talk to you unless you offer it a trade e.g. of some paperclips?
Is it possible that we’re conflating “friendly” with “useful but NOT unfriendly” and we’re struggling with defining what “useful” means?
If it likes sitting in a corner and thinking to itself, and doesn’t care about anything else, it is very likely to turn everything around it (including us) into computronium so that it can think to itself better.
If you put a threshold on it to prevent it from doing stuff like that, that’s a little better, but not much. If it has a utility function that says “Think to yourself about stuff, but do not mess up the lives of humans in doing so”, then what you have now is an AI that is motivated to find loopholes in (the implementation of) that second clause, because anything that can get an increased fulfilment of the first clause will give it a higher utility score overall.
You can get more and more precise than that and cover more known failure modes with their own individual rules, but if it’s very intelligent or powerful it’s tough to predict what terrible nasty stuff might still be in the intersection of all the limiting conditions we create. Hidden complexity of wishes and all that jazz.