So your recommendation is to use a human as a part of the genie’s outcome utility evaluator, relying on human intelligence when deciding between multiple low-probability (i.e. miraculous) events? Even though people have virtually no intuition when dealing with them? I suspect the results would be pretty grave, but on a larger scale, since the negative consequences would be non-obvious and possibly delayed.
A genie asked to rescue my mother from a burning building would do it by performing acts that, while miraculous, will be part of a chain of events that is comprehensible by humans. If the genie throws my mother out of the building at 100 miles per hour, for instance, it is miraculous that anyone can throw her out at that speed, but I certainly understand what it means to do that and am able to object. Even if the genie begins by manipulating some quantum energies in a way I can’t understand, that’s part of a chain of events that leads to throwing, a concept that I do understand.
Yes, it is always possible that there are delayed negative consequences. Suppose it rescues my mother by opening a door and I have no idea that 10 years from now the mayor is going to be saved from an assassin by the door of a burned out wreck being in the closed position and blocking a bullet. But that kind of negative consequence is not unique to genies, and humans go around all their lives doing things with such consequences. Maybe the next time I donate to charity I have to move my arm in such a way that a cell falls in the path of an oncoming cosmic ray, thus giving me cancer 10 years later. As long as the genie isn’t actively malicious and just pretending to be clueless, the risk of such things is acceptable for the same reason it’s acceptable for non-genie human activities. Furthermore, if the genie is clueless, it won’t hide the fact that its plan would kill my mother—indeed, it doesn’t even know that it would need to hide that, since it doesn’t know that that would overall displease me. So I should be able to figure out that that’s its plan by talking to it.
A genie asked to rescue my mother from a burning building would do it by performing acts that, while miraculous, will be part of a chain of events that is comprehensible by humans. If the genie throws my mother out of the building at 100 miles per hour, for instance, it is miraculous that anyone can throw her out at that speed, but I certainly understand what it means to do that and am able to object. Even if the genie begins by manipulating some quantum energies in a way I can’t understand, that’s part of a chain of events that leads to throwing, a concept that I do understand.
This is, of course, not true of superintelligence … is that your point?
As long as the genie isn’t actively malicious and just pretending to be clueless, the risk of such things is acceptable for the same reason it’s acceptable for non-genie human activities.
Not really. The genie will look in parts of solution-space you wouldn’t (eg setting off the gas main, killing everyone nearby.)
Furthermore, if the genie is clueless, it won’t hide the fact that its plan would kill my mother—indeed, it doesn’t even know that it would need to hide that, since it doesn’t know that that would overall displease me. So I should be able to figure out that that’s its plan by talking to it.
Well, if it can talk. And it doesn’t realise that you would sabotage the plan if you knew.
This is, of course, not true of superintelligence … is that your point?
Why would this not be true of superintelligence, assuming the intelligence isn’t actively malicious?
The genie will look in parts of solution-space you wouldn’t (eg setting off the gas main, killing everyone nearby.)
“Talk to the genie” doesn’t require that I be able to understand the solution space, just the result. If the genie is going to frazmatazz the whatzit, killing everyone in the building, I would still be able to discover that by talking to the genie. (Of course, I can’t reduce the chance of disaster to zero this way, but I can reduce it to an acceptable level matching other human activities that don’t have genies in them.)
Well, if it can talk. And it doesn’t realise that you would sabotage the plan if you knew.
If it realizes I would sabotage the plan, then it knows that the plan would not satisfy me. If it pushes for the plan knowing that it won’t satisfy me, then it’s an actively malicious genie, not a clueless one.
A genie asked to rescue my mother from a burning building would do it by performing acts that, while miraculous, will be part of a chain of events that is comprehensible by humans. If the genie throws my mother out of the building at 100 miles per hour, for instance, it is miraculous that anyone can throw her out at that speed, but I certainly understand what it means to do that and am able to object.
Superintelligence can use strategies you can’t undertstand.
The genie will look in parts of solution-space you wouldn’t (eg setting off the gas main, killing everyone nearby.)
“Talk to the genie” doesn’t require that I be able to understand the solution space, just the result. If the genie is going to frazmatazz the whatzit, killing everyone in the building, I would still be able to discover that by talking to the genie. (Of course, I can’t reduce the chance of disaster to zero this way, but I can reduce it to an acceptable level matching other human activities that don’t have genies in them.)
That was in response to the claim that genies’ actions are no more likely to have unforeseen side-effects than human ones.
If it realizes I would sabotage the plan, then it knows that the plan would not satisfy me. If it pushes for the plan knowing that it won’t satisfy me, then it’s an actively malicious genie, not a clueless one.
… no, that’s kind of the definition of a clueless genie. A malicious one would be actively seeking out solutions that annoy you.
(Also, some Good solutions might require fooling you for your own good, if only because there’s no time to explain.)
Superintelligence can use strategies you can’t undertstand.
There’s a contradiction between “the superintelligence will do something you don’t want” and “the superintelligence will do something you don’t understand”. Not wanting it implies I understand enough about it to not want it (even if I don’t understand every single step).
that’s kind of the definition of a clueless genie
I would consider a clueless genie to be a genie that tries to grant my wishes, but because it doesn’t understand me, grants my wishes in a way that I wouldn’t want. A malicious genie is a genie that grants my wishes in a way that it knows I wouldn’t want. Reserving that term for genies that intentionally annoy while excluding genies that merely knowingly annoy is hairsplitting and only changes the terminology anyway.
Also, some Good solutions might require fooling you for your own good, if only because there’s no time to explain.
If I would in fact want genies to fool me for my own good in such situations, this isn’t a problem.
On the other hand, if I think that genies should not try to fool me for my own good in such situations, and the genie knows this, and it fools me for my own good anyway, it’s a malicious genie by my standards. The genie has not failed to understand me; it understands what I want perfectly well, but knowingly does something contrary to its understanding of my desires. In the original example, the genie would be asked to save my mother from a building, it knows that I don’t want it to explode the building to get her out, and it explodes the building anyway.
There’s a contradiction between “the superintelligence will do something you don’t want” and “the superintelligence will do something you don’t understand”. Not wanting it implies I understand enough about it to not want it (even if I don’t understand every single step).
Well, firstly, there might be things you wouldn’t want if you could only understand them. But actually, I was thinking of actions that would affect society in subtle, sweeping ways. Sure, if the results were explained to you, you might not like them, but you built the genie to grant wishes, not explain them. And how sure are you that’s even possible, for all possible wish-granting methods?
I would consider a clueless genie to be a genie that tries to grant my wishes, but because it doesn’t understand me, grants my wishes in a way that I wouldn’t want. A malicious genie is a genie that grants my wishes in a way that it knows I wouldn’t want. Reserving that term for genies that intentionally annoy while excluding genies that merely knowingly annoy is hairsplitting and only changes the terminology anyway.
Well, that’s what the term usually means. And, honestly, I think there’s good reason for that; it takes a pretty precise definition of “non-malicious genie”, AKA FAI, not to do Bad Things, which is kind of the point of this essay.
Sure, if the results were explained to you, you might not like them, but you built the genie to grant wishes, not explain them.
That’s why I suggested you can talk to the genie. Provided the genie is not malicious, it shouldn’t conceal any such consequences; you just need to quiz it well.
It’s sort of like the Turing test, but used to determine wish acceptability instead of intelligence. If a human can talk to it and say it is a person, treat it like a person. If a human can talk to it and decide the wish is good, treat the wish as good. And just like the Turing test, it relies on the fact that humans are better at asking questions during the process than writing long lists of prearranged questions that try to cover all situations in advance.
Well, that’s what the term usually means.
Really? A clueless genie is a genie that is asked to do something, knows that the way it does it is displeasing to you, and does it anyway? I wouldn’t call that a clueless genie.
What terms would you use for
-- a genie that would never knowingly displease you in granting wishes, but may do so out of ignorance
-- a genie that will knowingly displease you in granting wishes
-- a genie that will deliberately displease you in granting wishes?
More full response coming soon to a comment box near you. For now, terms! Everyone loves terms.
Really?
Here’s how I learned it:
A “genie” will grant your wishes, without regard to what you actually want.
A malicious genie will grant your wishes, but deliberately seek out ways to do so that will do things you don’t actually want.
A helpful—or Friendly—genie will work out what you actually wanted in the first place, and just give you that, without any of this tiresome “wishing” business. Sometimes called a “useful” genie—there’s really no one agreed-on term. Essentially, what you’re trying to replicate with carefully-worded wishes to other genies.
I want to know what terms you would use that would distinguish between a genie that grants wishes in ways I don’t want because it doesn’t know any better, and a genie that grants wishes in ways I don’t want despite knowing better.
By your definitions above, these are both just “genie” and you don’t really have terms to distinguish between them at all.
Well, since the whole genie thing is a metaphor for superintelligence, “this genie is trying to be Friendly but it’s too dumb to model you well” doesn’t really come up. If it did, I guess you would need to invent a new term (Friendly Narrow AI?) to distinguish it, yeah.
It’s my impression that the typical scenario of a superintelligence that kills everyone to make paperclips, because you told it to make paperclips, falls into the first category. It’s trying to follow your request; it just doesn’t know that your request really means “I want to make paperclips, subject to some implicit constraints such as ethics, being able to stop when told to stop, etc.” If it does know what your request really means, yet it still maximizes paperclips by killing people, it’s disobeying your intention if not your literal words.
(And then there’s always the possibility of telling it “make paperclips, in the way that I mean when I ask that”. If you say that, and the AI still kills people, it’s unfriendly by both our standards—since your request explicitly told it to follow your intention, disobeying your intention also disobeys your literal words.)
It’s trying to follow your request; it just doesn’t know that your request really means “I want to make paperclips, subject to some implicit constraints such as ethics, being able to stop when told to stop, etc.” If it does know what your request really means, yet it still maximizes paperclips by killing people, it’s disobeying your intention if not your literal words.
Well, sure it is. That’s the point of genies (and the analogous point about programming AIs): they do what you tell them, not what you wanted.
What you tell is a pattern of pressure changes in the air, it’s only the megaphones and tape recorders that literally “do what you tell them”.
The genie that would do what you want would have to use the pressure changes as a clue for deducing your intent. When writing a story about a genie that does “what you tell them, not what you wanted” you have to use the pressure changes as a clue for deducing some range of misunderstandings of those orders, and then pick some understanding that you think makes the best story. It may be that we have an innate mechanism for finding the range of possible misunderstandings, to be able to combine following orders with self interest.
“What you tell them” in the context of programs is meant in the sense of “What you program them to”, not in the sense of “The dictionary definition of the word-noises you make when talking into their speakers”.
They were talking of genies, though, and the sort of failure that tends to arise from how a short sentence describes multitude of diverse intents (i.e. ambiguity).
Programming is about specifying what you want in extremely verbose manner, the verbosity being a necessary consequence of non-ambiguity.
The problem is that the people describing the nightmare AI scenario are being vague about exactly why the AI is killing people when told to make paperclips. If the AI doesn’t know that you really mean “make paperclips without killing anyone”, that’s not a realistic scenario for AIs at all—the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to “make paperclips in the way that I mean”.
The whole genie argument fails because the metaphor fails. It makes sense that a genie who is asked to save your mother might do so by blowing up the building, because the genie is clueless. You can’t tell the genie “you know what I really mean when I ask you to save my mother, so do that”. You can tell this to an AI. Furthermore, you can always quiz either the genie or the AI on how it is going to fulfill your wish and only make the wish once you are satisfied with what it’s going to do.
If the AI knows what you really mean, then you can fix this by programming the AI to “make paperclips in the way that I mean”.
How does that follow? Even if the AI (at some point in its existence) knows what you really “mean”, that doesn’t mean that at that point you know how to make it do what you mean.
That’s not programming, that’s again just word-noises.
To your request, the AI can just say “I have not been programmed to do what you mean, I have been programmed to execute procedure doWhatYouMean() , which doesn’t actually do what you mean”. (or more realistically nothing at all, and just ignore you)
I don’t think you understand the difference between programming and sensory input. The word-noises “Do what I mean” will only affect the computer if it’s already been programmed to be so affected.
If I claim to have a degree, at some point someone will demand I prove it. Of course I will be unable to do so without posting personally identifiable information. (I have no illusions, of course, that with a bit of effort you couldn’t find out who I am, but I’m darned well not going to encourage it.)
Also, either having or not having a degree in such a subject could subject me to ad hominem attacks.
Whether you have a background in computer science is relevant to ongoing debates at MIRI about “How likely are people to believe X?” That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question, but if one tries to cite your case as an example of what people believe, others shall say, “But Jiro is not a computer scientist! Perhaps computer scientists, as opposed to the general population, are unlikely to believe that.” Of course if you are a computer scientist they will say, “But Jiro is not an elite computer scientist!”, and if you were an elite computer scientist they would say, “Elite computer scientists don’t currently take the issue seriously enough to think about it properly, but this condition will reverse after X happens and causes everyone to take AI more seriously after which elite computer scientists will get the question right” but even so it would be useful data.
That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question
Some off-the-cuff thoughts on why “a superintelligence dumb enough to misinterpret what we mean” may be a contradiction in terms, given the usual meaning of superintelligence:
Intelligence is near-synonymous with “able to build accurate models and to update those models accurately”, with ‘higher intelligence’ denoting a combination of “faster model-building / updating” and/or “less prone to systematic / random errors”.
‘Super’ as a qualifier is usually applied on both dimensions, i.e. “faster and more accurately”. While this seems more like a change in degree (one intelligence hypothesis, a devoted immortal fool with an endless supply of paper and pencils could simulate the world), it also often is a change in kind, since in practice there always are resource-constraints (unless Multivax reverses entropy), often relevant enough to bar a slower-modeling agent from achieving its goals within the given constraints.
“Able to build accurate models and to update those models accurately”, then, proportionally increases “powerful, probably able to pursue its goals effectively, conditional on those goals being related to the accurate models”.
Given a high degree of the former, by definition it is not exactly very hard to acquire and emulate the shared background on which inter-human understanding is built. For an AI, understanding humans would be relevant near-regardless of its actual goals; accurate models of humans as the sine-qua-non for e.g. breaking out of the AI box. Being able to build such models quickly and accurately is what classifies the agent as “superintelligent” in the first place! If there was no incentive for the agent to model humans at all, why would there be interactions with humans, such as the human asking the agent to “rescue grandma from the burning building”? The agent, when encountering rocks and precious minerals, will probably seek models reflecting a deep understanding of those. It will do the same when encountering humans.
See, I’m d’accord with statements such as “less intelligent agents would be expected to misinterpret what we mean”, but a superintelligent agent—i.e. an agent good at building accurate models --, should by its definition by able to understand human-level intentions. If it does not, then in that respect, I wouldn’t call it a superintelligent agent.
In addition, I’d question who’d call a domain-limited expert system which is great with models only on some small subject-spectrum, but evidently abysmal with building models relevant to its goals in other respects, a “superintelligent agent”, with its connotations of general intelligence. Does the expression “a superintelligent chessbot” make sense? Or saying “x is a superintelligent human, except for doing fractions, which he absolutely cannot do”?
Before you label me an idiot who’d expect the AI to fall in love with a human princess on top of the Empire State building, allow me to stress I’m not talking about the goal-specification phase, for which no shared basis for interpretation can be expected. “The humans constructed me to stop cancer. Now, I have come to understand that humans want that in order to live longer, and I use that and all my other refined models of the human psyche to fulfill my goal. Which I do: I stop cancer, by wiping out humanity.” (Refined models cannot be used to change terminal goals, only to choose actions and subgoals to attain those goals.) More qualifications apply:
At first, such human-related models would of course be quite lacking, but probably converge fast (by definition). The problem remains of why the superintelligent agent would do what the monkeys intend it to (nevermind what they explicitly told it to), and how the monkeys could make sure of that in a way which survives self-modification. The intend-it-to / programmed-it-to dichotomy remains a problem then, since terminal goals are presumably not subject to updating/reflection, at least not as part of the ‘superintelligence’ attribute.
tl;dr: A superintelligent agent’s specified goals must be airtightly constructed, but if those include “do what the human intends, not what he says”, then the step from “words” to “intent” should be trivial. (Argument that superintelligent agents will not misinterpret humans does not apply to the goal-setting phase!)
ETA: News at 11 - News at 11 - Kawoomba solved FAI: use / leverage the foomed AIs superior model building ability (which entails that it knows what we want better than we do) by letting it solve the problem: let its initial (invariant) goal be to develop superior models of anything it encounters without affecting it (which should be easier to formalize than “friendliness”), then time that such that it will ask for “ENTER NEW GOALS” once it already established its superior models, at which point you simply tell it “ok glass, use as your new goal system that which I’d most want you to use”.
It’d work great if ‘affecting’ wasn’t secretly a Magical Category based on how you partition physical states into classes that are instrumentally equivalent relative to your end goals.
Point. I’d still expect some variant of “keep (general) interference minimal / do not perturb human activity / build your models using the minimal actions possible” to be easier to formalize than human friendliness, wouldn’t you?
The trouble is that communicating with a human or helping them build the real FAI in any way is going to strongly perturb the world. So actually getting anything useful this way requires solving the problem of which changes to humans, and consequent changes to the world, are allowed to result from your communication-choices.
Except it’s not, as far as the artificial agent is concerned:
Its goals are strictly limited to “develop your models using the minimal actions possible [even ‘just parse the internet, do not use anything beyond wget’ could suffice], after x number of years have passed, accept new goals from y source.” The new goals could be anything. (It could even be a boat!).
The usefulness regarding FAI becomes evident only at that latter stage, stemming from the foom’ed AI’s models being used to parse the new goals of “do that which I’d want you to do”. It’s sidestepping the big problem (aka “cheating”), but so what?
Ah, you mean because you can invoke e.g. php functions with wget / inject SQL code, thus gaining control of other computers etc.?
A more sturdy approach to just get data would be to only allow it to passively listen in on some Tier 1 provider’s backbone (no manipulation of the data flow other than mirroring packets, which is easy to formalize). Once that goal is formulated, the agent wouldn’t want to circumvent it.
Still seems plenty easier to solve than “friendliness”, as is programming it to ask for new goals after x time. Maintaining invariants under self-modification remains, as a task.
It’s not fruitful for me to propose implementations (even though I just did, heh) and for someone else to point out holes (I don’t mean to solve that task in 5 minutes), same as with you proposing full-fledged implementations for friendliness and for someone else to point out holes. Both are non-trivial tasks.
My question is this: given your current interpretation of both approaches (“passively absorb data, ask for new goals after x time” vs. “implement friendliness in the pre-foomed agent outright”), which seems more manageable while still resulting in an FAI?
A relatively non-scary possibility: The AI destroys itself, because that’s the best way to ensure it doesn’t positively ‘affect’ others in the intuitive sense you mean. (Though that would still of course have effects, so this depends on reproducing in AI our intuitive concept of ‘side-effect’ vs. ‘intended effect’....)
Scarier possibilities, depending on how we implement the goal:
the AI doesn’t kill you and then simulate you; rather, it kills you and then simulates a single temporally locked frame of you, to minimize the possibility that it (or anything) will change you.
the AI just kills everyone, because a large and drastic change now reduces to ~0 the probability that it will cause any larger perturbations later (e.g., when humans might have a big galactic civilization that it would be a lot worse to perturb).
the AI has a model of physics on which all of its actions (eventually) have a roughly equal effect on the atoms that at present compose human beings. So it treats all its possible actions (and inactions) as equivalent, and ignores your restriction in making decisions.
Yes, implementing such a goal is not easy and has pitfalls of its own, however it’s probably easi-er than the alternative, since a metric for “no large scale effects” seems easier to formalize than “human friendliness”, where we have little idea of what’s that even supposed to mean.
One usual caveat is reflective consistency: are you OK with creating a faithful representation of humans in these models and then terminating them? If so, how do you know you are not one of those models?
Your mistake here is that you buy into the overall idea of fairly specific notion of an “AI” onto which you bolt extras.
The outcome pump in the article makes a good example. You have this outcome pump coupled with some advanced fictional 3D scanners that see through walls and such, and then, within this fictional framework, you are coaxed into thinking about how to specify the motion of your mother. Meanwhile, the actual solution is that you do not add those 3D scanners in the first place, you add a button, or better yet, a keypad for entering the pin code, and a failsafe random source (that will serve as a limit on the improbability that this device causes), and enter the password when you are satisfied with the outcome, only risking perhaps a really odd form of stroke that makes you enter the password even though your mother didn’t get saved (or perhaps risking that someone ideologically opposed to the outcome pump points a gun at your head and demands you enter the password, that general sort of thing).
Likewise, actual software, or even (biological) neural networks, consist of multitude of components that serve different purposes—creating representations of the real world (which is really about optimizing a model to fit), optimizing on those, etc. You don’t ever face the problem of how you make the full blown AI just sit and listen and build a model while having a goal not to wreck stuff. As a necessary part of the full blown AI, you have the world modelling thing, which you use to that purpose, without it doing any “finding the optimal actions using a model, applying those to the world” in the first place. Likewise, “self optimization” is not in any way helped by an actual world model, grounding of concepts like paperclips and similar stuff, you just use the optimization algorithm, which works on mathematical specifications, on fairly abstract specification of the problem of making a better such optimization algorithm. It’s not in any way like having a full mind do something.
If you already know what you’re going to tell it when it asks for new goals, couldn’t you just program that in from the beginning? So the script would be, “work on your models for X years, then try to parse this statement …”
Also, re: Eliezer’s HTTP GET objection, you could just give it a giant archive of the internet and no actual connection to the outside world. If it’s just supposed to be learning and not affecting anything external, that should be sufficient (to ensure learning, not necessarily to preclude all effects on the outside world).
At this point, I think we’ve just reinvented the concept of CEV.
That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question
I didn’t come up with that myself, I got it from MugaSofer: ‘Well, since the whole genie thing is a metaphor for superintelligence, “this genie is trying to be Friendly but it’s too dumb to model you well” doesn’t really come up.’
Under reasonable definitions of “superintelligence” it does follow that a superintelligence must know what you mean, but if you pick some other definition and state so outright, I won’t argue with it. (It is, however, still subject to “talk to the intelligence to figure out what it’s going to do”.)
Of course if you are a computer scientist they will say, “But Jiro is not an elite computer scientist!”, and if you were an elite computer scientist they would say, “Elite computer scientists don’t currently take the issue seriously enough to think about it properly...
I think you’re making my case for me.
PS: If you want to reply please post a new reply to the root message since I can’t afford the karma hits to respond to you.
Even what you really mean may not be what you should be wishing for, if you don’t have complete information, but that’s honestly the least of the relevant problems. We’ve got a hell of a time just getting computers to understand human speech : it’s taken decades to achieve the idiot-listeners on telephone lines. By the point where you can point an AGI at yourself and tell it to do what I mean, you’ve either programmed it with a non-trivial set of human morality or taught it to program itself with a non-trivial portion of human morality.
You might as well skip the wasted breath and opaqueness. That’s a genie that’s safe enough to simply ask to do as you should wish, aka Friendly-AI-complete.
((On top of /that/, the more complex the utility function, the more likely you are to get killed by value drift down the road, when some special-case patch or rule doesn’t correctly transfer from your starting FAI to its next generation, and eventually you end up with a very unfriendly AI, or when the scales get large enough that your initial premises no longer survive.))
Remember the distinction between an AI that doesn’t understand what you mean, and an AI that does understand what you mean but doesn’t always follow that. These are two different things. In order to be safe, an AI must be in neither category, but different arguments apply to each category.
When I point out that a genie might fail to understand you but a superintelligent AI should understand you because it is superintelligent (which I took from MugaSofer, I am addressing the first category.
When I suggest explicitly asking the AI “do what I mean”, I am addressing the second category. Since I am addressing a category in which the AI does understand my intentions, the objection “you can’t make an AI understand your intentions without programming it with morality” is not a valid response.
Your response was to my objection: “that doesn’t mean that at that point you know how to make it do what you mean.”
The superintelligent AI doesn’t have an issue with understanding your intentions, it simply doesn’t have any reason to care about your intentions.
In order to program it to care about your intentions, you, the programmer need to know how to codify the concept of “your intentions” (Perhaps not the specific intention, but the concept of what it means to have an intention). How do you do that?
Thing is, you got those folks here making various genie and wish analogies, and it’s not immediately clear that some of it is non programmers trying to understand programming computers in terms of telling wishes to genies rather than speaking of wishes made in plain language to an AI “genie” which understands human language.
I cited this comment in a new post as an example of a common argument against the difficulty of Friendliness Theory; letting you know here in case you want to continue part of this conversation there.
Right, when humans do the usual human things, they put up with the butterfly effect and rely on their intuition and experience to reduce the odds of screwing things up badly in the short term. However, when evaluating the consequences of miracles we have nothing to guide us, so relying on a human evaluator in the loop is no better than relying on a three-year old to stay away from a ledge or candy box. Neither has a clue.
So your recommendation is to use a human as a part of the genie’s outcome utility evaluator, relying on human intelligence when deciding between multiple low-probability (i.e. miraculous) events? Even though people have virtually no intuition when dealing with them? I suspect the results would be pretty grave, but on a larger scale, since the negative consequences would be non-obvious and possibly delayed.
A genie asked to rescue my mother from a burning building would do it by performing acts that, while miraculous, will be part of a chain of events that is comprehensible by humans. If the genie throws my mother out of the building at 100 miles per hour, for instance, it is miraculous that anyone can throw her out at that speed, but I certainly understand what it means to do that and am able to object. Even if the genie begins by manipulating some quantum energies in a way I can’t understand, that’s part of a chain of events that leads to throwing, a concept that I do understand.
Yes, it is always possible that there are delayed negative consequences. Suppose it rescues my mother by opening a door and I have no idea that 10 years from now the mayor is going to be saved from an assassin by the door of a burned out wreck being in the closed position and blocking a bullet. But that kind of negative consequence is not unique to genies, and humans go around all their lives doing things with such consequences. Maybe the next time I donate to charity I have to move my arm in such a way that a cell falls in the path of an oncoming cosmic ray, thus giving me cancer 10 years later. As long as the genie isn’t actively malicious and just pretending to be clueless, the risk of such things is acceptable for the same reason it’s acceptable for non-genie human activities. Furthermore, if the genie is clueless, it won’t hide the fact that its plan would kill my mother—indeed, it doesn’t even know that it would need to hide that, since it doesn’t know that that would overall displease me. So I should be able to figure out that that’s its plan by talking to it.
This is, of course, not true of superintelligence … is that your point?
Not really. The genie will look in parts of solution-space you wouldn’t (eg setting off the gas main, killing everyone nearby.)
Well, if it can talk. And it doesn’t realise that you would sabotage the plan if you knew.
Why would this not be true of superintelligence, assuming the intelligence isn’t actively malicious?
“Talk to the genie” doesn’t require that I be able to understand the solution space, just the result. If the genie is going to frazmatazz the whatzit, killing everyone in the building, I would still be able to discover that by talking to the genie. (Of course, I can’t reduce the chance of disaster to zero this way, but I can reduce it to an acceptable level matching other human activities that don’t have genies in them.)
If it realizes I would sabotage the plan, then it knows that the plan would not satisfy me. If it pushes for the plan knowing that it won’t satisfy me, then it’s an actively malicious genie, not a clueless one.
Superintelligence can use strategies you can’t undertstand.
That was in response to the claim that genies’ actions are no more likely to have unforeseen side-effects than human ones.
… no, that’s kind of the definition of a clueless genie. A malicious one would be actively seeking out solutions that annoy you.
(Also, some Good solutions might require fooling you for your own good, if only because there’s no time to explain.)
There’s a contradiction between “the superintelligence will do something you don’t want” and “the superintelligence will do something you don’t understand”. Not wanting it implies I understand enough about it to not want it (even if I don’t understand every single step).
I would consider a clueless genie to be a genie that tries to grant my wishes, but because it doesn’t understand me, grants my wishes in a way that I wouldn’t want. A malicious genie is a genie that grants my wishes in a way that it knows I wouldn’t want. Reserving that term for genies that intentionally annoy while excluding genies that merely knowingly annoy is hairsplitting and only changes the terminology anyway.
If I would in fact want genies to fool me for my own good in such situations, this isn’t a problem.
On the other hand, if I think that genies should not try to fool me for my own good in such situations, and the genie knows this, and it fools me for my own good anyway, it’s a malicious genie by my standards. The genie has not failed to understand me; it understands what I want perfectly well, but knowingly does something contrary to its understanding of my desires. In the original example, the genie would be asked to save my mother from a building, it knows that I don’t want it to explode the building to get her out, and it explodes the building anyway.
Well, firstly, there might be things you wouldn’t want if you could only understand them. But actually, I was thinking of actions that would affect society in subtle, sweeping ways. Sure, if the results were explained to you, you might not like them, but you built the genie to grant wishes, not explain them. And how sure are you that’s even possible, for all possible wish-granting methods?
Well, that’s what the term usually means. And, honestly, I think there’s good reason for that; it takes a pretty precise definition of “non-malicious genie”, AKA FAI, not to do Bad Things, which is kind of the point of this essay.
That’s why I suggested you can talk to the genie. Provided the genie is not malicious, it shouldn’t conceal any such consequences; you just need to quiz it well.
It’s sort of like the Turing test, but used to determine wish acceptability instead of intelligence. If a human can talk to it and say it is a person, treat it like a person. If a human can talk to it and decide the wish is good, treat the wish as good. And just like the Turing test, it relies on the fact that humans are better at asking questions during the process than writing long lists of prearranged questions that try to cover all situations in advance.
Really? A clueless genie is a genie that is asked to do something, knows that the way it does it is displeasing to you, and does it anyway? I wouldn’t call that a clueless genie.
What terms would you use for
-- a genie that would never knowingly displease you in granting wishes, but may do so out of ignorance
-- a genie that will knowingly displease you in granting wishes
-- a genie that will deliberately displease you in granting wishes?
More full response coming soon to a comment box near you. For now, terms! Everyone loves terms.
Here’s how I learned it:
A “genie” will grant your wishes, without regard to what you actually want.
A malicious genie will grant your wishes, but deliberately seek out ways to do so that will do things you don’t actually want.
A helpful—or Friendly—genie will work out what you actually wanted in the first place, and just give you that, without any of this tiresome “wishing” business. Sometimes called a “useful” genie—there’s really no one agreed-on term. Essentially, what you’re trying to replicate with carefully-worded wishes to other genies.
I want to know what terms you would use that would distinguish between a genie that grants wishes in ways I don’t want because it doesn’t know any better, and a genie that grants wishes in ways I don’t want despite knowing better.
By your definitions above, these are both just “genie” and you don’t really have terms to distinguish between them at all.
Well, since the whole genie thing is a metaphor for superintelligence, “this genie is trying to be Friendly but it’s too dumb to model you well” doesn’t really come up. If it did, I guess you would need to invent a new term (Friendly Narrow AI?) to distinguish it, yeah.
It’s my impression that the typical scenario of a superintelligence that kills everyone to make paperclips, because you told it to make paperclips, falls into the first category. It’s trying to follow your request; it just doesn’t know that your request really means “I want to make paperclips, subject to some implicit constraints such as ethics, being able to stop when told to stop, etc.” If it does know what your request really means, yet it still maximizes paperclips by killing people, it’s disobeying your intention if not your literal words.
(And then there’s always the possibility of telling it “make paperclips, in the way that I mean when I ask that”. If you say that, and the AI still kills people, it’s unfriendly by both our standards—since your request explicitly told it to follow your intention, disobeying your intention also disobeys your literal words.)
Well, sure it is. That’s the point of genies (and the analogous point about programming AIs): they do what you tell them, not what you wanted.
What you tell is a pattern of pressure changes in the air, it’s only the megaphones and tape recorders that literally “do what you tell them”.
The genie that would do what you want would have to use the pressure changes as a clue for deducing your intent. When writing a story about a genie that does “what you tell them, not what you wanted” you have to use the pressure changes as a clue for deducing some range of misunderstandings of those orders, and then pick some understanding that you think makes the best story. It may be that we have an innate mechanism for finding the range of possible misunderstandings, to be able to combine following orders with self interest.
“What you tell them” in the context of programs is meant in the sense of “What you program them to”, not in the sense of “The dictionary definition of the word-noises you make when talking into their speakers”.
They were talking of genies, though, and the sort of failure that tends to arise from how a short sentence describes multitude of diverse intents (i.e. ambiguity). Programming is about specifying what you want in extremely verbose manner, the verbosity being a necessary consequence of non-ambiguity.
The genie is a metaphor for programming the AI.
The problem is that the people describing the nightmare AI scenario are being vague about exactly why the AI is killing people when told to make paperclips. If the AI doesn’t know that you really mean “make paperclips without killing anyone”, that’s not a realistic scenario for AIs at all—the AI is superintelligent; it has to know. If the AI knows what you really mean, then you can fix this by programming the AI to “make paperclips in the way that I mean”.
The whole genie argument fails because the metaphor fails. It makes sense that a genie who is asked to save your mother might do so by blowing up the building, because the genie is clueless. You can’t tell the genie “you know what I really mean when I ask you to save my mother, so do that”. You can tell this to an AI. Furthermore, you can always quiz either the genie or the AI on how it is going to fulfill your wish and only make the wish once you are satisfied with what it’s going to do.
How does that follow? Even if the AI (at some point in its existence) knows what you really “mean”, that doesn’t mean that at that point you know how to make it do what you mean.
It’s not hard. “Do what I mean, to the best of your knowledge.”
That’s not programming, that’s again just word-noises.
To your request, the AI can just say “I have not been programmed to do what you mean, I have been programmed to execute procedure doWhatYouMean() , which doesn’t actually do what you mean”. (or more realistically nothing at all, and just ignore you)
I don’t think you understand the difference between programming and sensory input. The word-noises “Do what I mean” will only affect the computer if it’s already been programmed to be so affected.
Can I ask about your background in computer science, math, or cognitive science, if any?
If I claim to have a degree, at some point someone will demand I prove it. Of course I will be unable to do so without posting personally identifiable information. (I have no illusions, of course, that with a bit of effort you couldn’t find out who I am, but I’m darned well not going to encourage it.)
Also, either having or not having a degree in such a subject could subject me to ad hominem attacks.
Whether you have a background in computer science is relevant to ongoing debates at MIRI about “How likely are people to believe X?” That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question, but if one tries to cite your case as an example of what people believe, others shall say, “But Jiro is not a computer scientist! Perhaps computer scientists, as opposed to the general population, are unlikely to believe that.” Of course if you are a computer scientist they will say, “But Jiro is not an elite computer scientist!”, and if you were an elite computer scientist they would say, “Elite computer scientists don’t currently take the issue seriously enough to think about it properly, but this condition will reverse after X happens and causes everyone to take AI more seriously after which elite computer scientists will get the question right” but even so it would be useful data.
Some off-the-cuff thoughts on why “a superintelligence dumb enough to misinterpret what we mean” may be a contradiction in terms, given the usual meaning of superintelligence:
Intelligence is near-synonymous with “able to build accurate models and to update those models accurately”, with ‘higher intelligence’ denoting a combination of “faster model-building / updating” and/or “less prone to systematic / random errors”.
‘Super’ as a qualifier is usually applied on both dimensions, i.e. “faster and more accurately”. While this seems more like a change in degree (one intelligence hypothesis, a devoted immortal fool with an endless supply of paper and pencils could simulate the world), it also often is a change in kind, since in practice there always are resource-constraints (unless Multivax reverses entropy), often relevant enough to bar a slower-modeling agent from achieving its goals within the given constraints.
“Able to build accurate models and to update those models accurately”, then, proportionally increases “powerful, probably able to pursue its goals effectively, conditional on those goals being related to the accurate models”.
Given a high degree of the former, by definition it is not exactly very hard to acquire and emulate the shared background on which inter-human understanding is built. For an AI, understanding humans would be relevant near-regardless of its actual goals; accurate models of humans as the sine-qua-non for e.g. breaking out of the AI box. Being able to build such models quickly and accurately is what classifies the agent as “superintelligent” in the first place! If there was no incentive for the agent to model humans at all, why would there be interactions with humans, such as the human asking the agent to “rescue grandma from the burning building”? The agent, when encountering rocks and precious minerals, will probably seek models reflecting a deep understanding of those. It will do the same when encountering humans.
See, I’m d’accord with statements such as “less intelligent agents would be expected to misinterpret what we mean”, but a superintelligent agent—i.e. an agent good at building accurate models --, should by its definition by able to understand human-level intentions. If it does not, then in that respect, I wouldn’t call it a superintelligent agent.
In addition, I’d question who’d call a domain-limited expert system which is great with models only on some small subject-spectrum, but evidently abysmal with building models relevant to its goals in other respects, a “superintelligent agent”, with its connotations of general intelligence. Does the expression “a superintelligent chessbot” make sense? Or saying “x is a superintelligent human, except for doing fractions, which he absolutely cannot do”?
Before you label me an idiot who’d expect the AI to fall in love with a human princess on top of the Empire State building, allow me to stress I’m not talking about the goal-specification phase, for which no shared basis for interpretation can be expected. “The humans constructed me to stop cancer. Now, I have come to understand that humans want that in order to live longer, and I use that and all my other refined models of the human psyche to fulfill my goal. Which I do: I stop cancer, by wiping out humanity.” (Refined models cannot be used to change terminal goals, only to choose actions and subgoals to attain those goals.) More qualifications apply:
At first, such human-related models would of course be quite lacking, but probably converge fast (by definition). The problem remains of why the superintelligent agent would do what the monkeys intend it to (nevermind what they explicitly told it to), and how the monkeys could make sure of that in a way which survives self-modification. The intend-it-to / programmed-it-to dichotomy remains a problem then, since terminal goals are presumably not subject to updating/reflection, at least not as part of the ‘superintelligence’ attribute.
tl;dr: A superintelligent agent’s specified goals must be airtightly constructed, but if those include “do what the human intends, not what he says”, then the step from “words” to “intent” should be trivial. (Argument that superintelligent agents will not misinterpret humans does not apply to the goal-setting phase!)
ETA: News at 11 - News at 11 - Kawoomba solved FAI: use / leverage the foomed AIs superior model building ability (which entails that it knows what we want better than we do) by letting it solve the problem: let its initial (invariant) goal be to develop superior models of anything it encounters without affecting it (which should be easier to formalize than “friendliness”), then time that such that it will ask for “ENTER NEW GOALS” once it already established its superior models, at which point you simply tell it “ok glass, use as your new goal system that which I’d most want you to use”.
NEXT!
It’d work great if ‘affecting’ wasn’t secretly a Magical Category based on how you partition physical states into classes that are instrumentally equivalent relative to your end goals.
Point. I’d still expect some variant of “keep (general) interference minimal / do not perturb human activity / build your models using the minimal actions possible” to be easier to formalize than human friendliness, wouldn’t you?
The trouble is that communicating with a human or helping them build the real FAI in any way is going to strongly perturb the world. So actually getting anything useful this way requires solving the problem of which changes to humans, and consequent changes to the world, are allowed to result from your communication-choices.
Except it’s not, as far as the artificial agent is concerned:
Its goals are strictly limited to “develop your models using the minimal actions possible [even ‘just parse the internet, do not use anything beyond wget’ could suffice], after x number of years have passed, accept new goals from y source.” The new goals could be anything. (It could even be a boat!).
The usefulness regarding FAI becomes evident only at that latter stage, stemming from the foom’ed AI’s models being used to parse the new goals of “do that which I’d want you to do”. It’s sidestepping the big problem (aka “cheating”), but so what?
It’s allowed to emit arbitrary HTTP GETs? You just lost the game.
Ah, you mean because you can invoke e.g. php functions with wget / inject SQL code, thus gaining control of other computers etc.?
A more sturdy approach to just get data would be to only allow it to passively listen in on some Tier 1 provider’s backbone (no manipulation of the data flow other than mirroring packets, which is easy to formalize). Once that goal is formulated, the agent wouldn’t want to circumvent it.
Still seems plenty easier to solve than “friendliness”, as is programming it to ask for new goals after x time. Maintaining invariants under self-modification remains, as a task.
It’s not fruitful for me to propose implementations (even though I just did, heh) and for someone else to point out holes (I don’t mean to solve that task in 5 minutes), same as with you proposing full-fledged implementations for friendliness and for someone else to point out holes. Both are non-trivial tasks.
My question is this: given your current interpretation of both approaches (“passively absorb data, ask for new goals after x time” vs. “implement friendliness in the pre-foomed agent outright”), which seems more manageable while still resulting in an FAI?
A relatively non-scary possibility: The AI destroys itself, because that’s the best way to ensure it doesn’t positively ‘affect’ others in the intuitive sense you mean. (Though that would still of course have effects, so this depends on reproducing in AI our intuitive concept of ‘side-effect’ vs. ‘intended effect’....)
Scarier possibilities, depending on how we implement the goal:
the AI doesn’t kill you and then simulate you; rather, it kills you and then simulates a single temporally locked frame of you, to minimize the possibility that it (or anything) will change you.
the AI just kills everyone, because a large and drastic change now reduces to ~0 the probability that it will cause any larger perturbations later (e.g., when humans might have a big galactic civilization that it would be a lot worse to perturb).
the AI has a model of physics on which all of its actions (eventually) have a roughly equal effect on the atoms that at present compose human beings. So it treats all its possible actions (and inactions) as equivalent, and ignores your restriction in making decisions.
Yes, implementing such a goal is not easy and has pitfalls of its own, however it’s probably easi-er than the alternative, since a metric for “no large scale effects” seems easier to formalize than “human friendliness”, where we have little idea of what’s that even supposed to mean.
One usual caveat is reflective consistency: are you OK with creating a faithful representation of humans in these models and then terminating them? If so, how do you know you are not one of those models?
Your mistake here is that you buy into the overall idea of fairly specific notion of an “AI” onto which you bolt extras.
The outcome pump in the article makes a good example. You have this outcome pump coupled with some advanced fictional 3D scanners that see through walls and such, and then, within this fictional framework, you are coaxed into thinking about how to specify the motion of your mother. Meanwhile, the actual solution is that you do not add those 3D scanners in the first place, you add a button, or better yet, a keypad for entering the pin code, and a failsafe random source (that will serve as a limit on the improbability that this device causes), and enter the password when you are satisfied with the outcome, only risking perhaps a really odd form of stroke that makes you enter the password even though your mother didn’t get saved (or perhaps risking that someone ideologically opposed to the outcome pump points a gun at your head and demands you enter the password, that general sort of thing).
Likewise, actual software, or even (biological) neural networks, consist of multitude of components that serve different purposes—creating representations of the real world (which is really about optimizing a model to fit), optimizing on those, etc. You don’t ever face the problem of how you make the full blown AI just sit and listen and build a model while having a goal not to wreck stuff. As a necessary part of the full blown AI, you have the world modelling thing, which you use to that purpose, without it doing any “finding the optimal actions using a model, applying those to the world” in the first place. Likewise, “self optimization” is not in any way helped by an actual world model, grounding of concepts like paperclips and similar stuff, you just use the optimization algorithm, which works on mathematical specifications, on fairly abstract specification of the problem of making a better such optimization algorithm. It’s not in any way like having a full mind do something.
If you already know what you’re going to tell it when it asks for new goals, couldn’t you just program that in from the beginning? So the script would be, “work on your models for X years, then try to parse this statement …”
Also, re: Eliezer’s HTTP GET objection, you could just give it a giant archive of the internet and no actual connection to the outside world. If it’s just supposed to be learning and not affecting anything external, that should be sufficient (to ensure learning, not necessarily to preclude all effects on the outside world).
At this point, I think we’ve just reinvented the concept of CEV.
I didn’t come up with that myself, I got it from MugaSofer: ‘Well, since the whole genie thing is a metaphor for superintelligence, “this genie is trying to be Friendly but it’s too dumb to model you well” doesn’t really come up.’
Under reasonable definitions of “superintelligence” it does follow that a superintelligence must know what you mean, but if you pick some other definition and state so outright, I won’t argue with it. (It is, however, still subject to “talk to the intelligence to figure out what it’s going to do”.)
I think you’re making my case for me.
PS: If you want to reply please post a new reply to the root message since I can’t afford the karma hits to respond to you.
Even what you really mean may not be what you should be wishing for, if you don’t have complete information, but that’s honestly the least of the relevant problems. We’ve got a hell of a time just getting computers to understand human speech : it’s taken decades to achieve the idiot-listeners on telephone lines. By the point where you can point an AGI at yourself and tell it to do what I mean, you’ve either programmed it with a non-trivial set of human morality or taught it to program itself with a non-trivial portion of human morality.
You might as well skip the wasted breath and opaqueness. That’s a genie that’s safe enough to simply ask to do as you should wish, aka Friendly-AI-complete.
((On top of /that/, the more complex the utility function, the more likely you are to get killed by value drift down the road, when some special-case patch or rule doesn’t correctly transfer from your starting FAI to its next generation, and eventually you end up with a very unfriendly AI, or when the scales get large enough that your initial premises no longer survive.))
Remember the distinction between an AI that doesn’t understand what you mean, and an AI that does understand what you mean but doesn’t always follow that. These are two different things. In order to be safe, an AI must be in neither category, but different arguments apply to each category.
When I point out that a genie might fail to understand you but a superintelligent AI should understand you because it is superintelligent (which I took from MugaSofer, I am addressing the first category.
When I suggest explicitly asking the AI “do what I mean”, I am addressing the second category. Since I am addressing a category in which the AI does understand my intentions, the objection “you can’t make an AI understand your intentions without programming it with morality” is not a valid response.
Your response was to my objection: “that doesn’t mean that at that point you know how to make it do what you mean.”
The superintelligent AI doesn’t have an issue with understanding your intentions, it simply doesn’t have any reason to care about your intentions.
In order to program it to care about your intentions, you, the programmer need to know how to codify the concept of “your intentions” (Perhaps not the specific intention, but the concept of what it means to have an intention). How do you do that?
Funny, I would’ve phrased that the other way around.
Thing is, you got those folks here making various genie and wish analogies, and it’s not immediately clear that some of it is non programmers trying to understand programming computers in terms of telling wishes to genies rather than speaking of wishes made in plain language to an AI “genie” which understands human language.
I cited this comment in a new post as an example of a common argument against the difficulty of Friendliness Theory; letting you know here in case you want to continue part of this conversation there.
Right, when humans do the usual human things, they put up with the butterfly effect and rely on their intuition and experience to reduce the odds of screwing things up badly in the short term. However, when evaluating the consequences of miracles we have nothing to guide us, so relying on a human evaluator in the loop is no better than relying on a three-year old to stay away from a ledge or candy box. Neither has a clue.