Well, the other issue is also that people’s opinions tend to be more informative of their own general plans than about the field in general.
Imagine that there’s a bunch of nuclear power plant engineering teams—before nuclear power plants—working on different approaches.
One of the teams—not a particularly impressive one either—claimed that any nuclear plant is going to blow up like a hundred kiloton nuclear bomb, unless fitted with a very reliable and fast acting control system. This is actually how nuclear power plants were portrayed in early science fiction (“Blowups Happen”, by Heinlein).
So you look at the blueprints, and you see that everyone’s reactor is designed for a negative temperature coefficient of reactivity, in the high temperature range, and can’t blow up like a nuke. Except for one team whose reactor is not designed to make use of a negative temperature coefficient of reactivity. The mysterious disagreement is explained, albeit in a very boring way.
Except for one team whose reactor is not designed to make use of a negative temperature coefficient of reactivity.
Except that this contrarian team, made of high school drop-outs, former theologians, philosophers, mathematicians and coal power station technicians, never produce an actual design, instead they spend all their time investigating arcane theoretical questions about renormalization in quantum field theory and publish their possibly interesting results outside the scientific peer review system, relying on hype to disseminate them.
Well, they still have some plan, however fuzzy it is. The plan involves a reactor which according to it’s proponents would just blow up like a 100 kiloton nuke if not for some awesome control system they plan to someday work on. Or in case of AI, a general architecture that is going to self improve and literally kill everyone unless a correct goal is set for it. (Or even torture everyone if there’s a minus sign in the wrong place—the reactor analogy would be a much worse explosion still if the control rods get wired backwards. Which happens).
My feeling is that there may be risks for some potential designs, but they are not like “the brightest minds that build the first AI failed to understands some argument that even former theologians can follow” (In fiction this happens because said theologian is very special, in reality it happens because the argument is flawed or irrelevant)
“the brightest minds that build the first AI failed to understands some argument that even former theologians can follow”
This is related to something that I am quite confused about. There are basically 3 possibilities:
(1) You have to be really lucky to stumble across MIRI’s argument. Just being really smart is insufficient. So we should not expect whoever ends up creating the first AGI to think about it.
(2) You have to be exceptionally intelligent to come up with MIRI’s argument. And you have to be nowhere as intelligent in order to build an AGI that can take over the world.
(3) MIRI’s argument is very complex. Only someone who deliberately thinks about risks associated with AGI could come up with all the necessary details of the argument. The first people to build an AGI won’t arrive at the correct insights in time.
Maybe there is another possibility on how MIRI could end up being right that I have not thought about, let me know.
It seems to me that what all of these possibilities have in common is that they are improbable. Either you have to be (1) lucky or (2) exceptionally bright or (3) be right about a highly conjunctive hypothesis.
4) MIRI themselves are incredibly bad at phrasing their own argument. Go hunt through Eliezer’s LessWrong postings about AI risks, from which most of MIRI’s language regarding the matter is taken. The “genie metaphor”, of Some Fool Bastard being able to give an AGI a Bad Idea task in the form of verbal statements or C++-like programming at a conceptual level humans understand, appears repeatedly. The “genie metaphor” is a worse-than-nothing case of Generalizing From Fictional Evidence.
I would phrase the argument this way (and did so on Hacker News yesterday):
[T]hink of it in terms of mathematics rather than psychology. A so-called “artificial intelligence” is just an extremely sophisticated active[-environment], online learning agent designed to maximize some utility function or (equivalently) minimize some loss function. There’s no term in a loss function for “kill all humans”, but neither is there one for “do what humans want”, or better yet, “do what humans would want if they weren’t such complete morons half the time”.
This takes us away from magical genies that can be programmed with convenient meta-wishes like, “Do what I mean” or “be the Coherent Extrapolated Volition of humanity” and into the solid, scientific land of equations, accessible by everyone who ever took a machine-learning class in college.
I mean, seriously, my parents understand this phrasing, and they have no education in CS. They do, however, understand very well that a numerical score in some very specific game or task does not represent everything they want out of life, but that it will represent everything the AI wants out of life.
(EDIT: I apologize for any feelings I may have hurt with this comment, but I care about not being paper-clipped more than I care about your feelings. I would rather the scientific public, if not the general public, have a decent understanding of and concern for AGI safety engineering, than have everyone at MIRI get to feel like they’re extraordinarily rational and special for spotting a problem nobody else spotted.)
MIRI themselves are incredibly bad at phrasing their own argument.
Maybe it’s just the argument that is bad and wrong.
[T]hink of it in terms of mathematics rather than psychology. A so-called “artificial intelligence” is just an extremely sophisticated active[-environment], online learning agent designed to maximize some utility function or (equivalently) minimize some loss function.
What’s the domain of this function? I’ve a feeling that there’s some severe cross-contamination between the meaning of the word “function” as in an abstract mathematical function of something, and the meaning of the word “function” as in purpose of the genie that you have been cleverly primed with, by people who aren’t actually bad at phrasing anything but instead good at inducing irrationality.
If you were to think of mathematical functions, well, those don’t readily take real world as an input, do they?
Maybe it’s just the argument that is bad and wrong.
At least for the genie metaphor, I completely agree. That one is just plain wrong, and arguments for it are outright bad.
If you were to think of mathematical functions, well, those don’t readily take real world as an input, do they?
Ah, here’s where things get complicated.
In current models, the domain of the function is Symbols. As in, those things on Turing Machines. Literally: AIXI is defined to view the external universe as a Turing Machine whose output tape is being fed to AIXI, which then feeds back an input tape of Action Symbols. So you learned about this in CS401.
The whole point of phrasing things this way was to talk about general agents: agents that could conceivably receive and reason over any kind of inputs, thus rendering their utility domain to be defined over, indeed, the world.
Thing being, under current models, Utility and Reality are kept ontologically separate: they’re different input tapes entirely. An AIXI might wirehead and commit suicide that way, but the model of reality it learns is defined over reality. Any failures of ontology rest with the programmer for building an AI agent that has no concept of ontology, and therefore cannot be taught to value useful, high-level concepts other than the numerical input on its reward tape.
My point? You’re correct to say that current AGI models don’t take the Entire Real World as input to a magic-genie Verbally Phrased Utility Function like “maximize paperclips”. That is a fantasy, we agree on that. So where the hell is the danger, or the problem? Well, the problem is that human AGI researchers are not going to leave it that way. We humans are the ones who want AIs we can order to solve particular problems. We are the ones who will immediately turn the first reinforcement or value learning AGIs, which will be expensive and difficult to operate, towards the task of building more sophisticated AGI architectures that will be easier to direct, more efficient, cheaper, and more capable of learning—and eventually even self-improvement!
Which means that, if it should come to that, we humans will be the ones who deliberately design AGI architectures that can receive orders in the form of a human-writable program. And that, combined with the capability for self-improvement, would be the “danger spot” where a semi-competent AGI programmer can accidentally direct a machine to do something dangerous without its having enough Natural Language Processing capability built in to understand and execute the intent behind a verbally phrased goal, thus resulting in the programmer failing to specify something because he wasn’t so good at coding everything in.
(Some portable internal representation of beliefs, by the way, is one of the fundamental necessities for a self-improving FOOMy AGI, which is why nobody really worries too much about neural networks self-improving and killing us all.)
Now, does all this support the capital-N Narrative of the old SIAI, that we will all die a swift, stupid death if we don’t give them all our money now? Absolutely not.
However, would you prefer that the human-implemented bootstrap path from barely-intelligent, ultra-inefficient reinforcement/value learning agents to highly intelligent, ultra-efficient self-improving goal fulfilment devices be very safe, with few chances for even significant damage by well-intentioned idiots, or very dangerous, with conspiracies, espionage, weaponization, and probably a substantial loss of life due to sheer accidents?
Personally, I prefer the former, so I think machine ethics is a worthwhile pursuit, regardless of whether the dramatized, ZOMFG EVIL GENIE WITH PAPERCLIPS narrative is worth anything.
At least for the genie metaphor, I completely agree. That one is just plain wrong, and arguments for it are outright bad.
It looks like the thinking about the AI is based on that sort of metaphors, to be honest. The loudest AI risk proponents proclaim all AIs to pose a dire threat. Observe all the discussions regarding “Oracle AI” which absolutely doesn’t need to work like a maximiser of something real.
...
Seems like one huge conjunction of very many assumptions with regards to how the AI development would work out. E.g. you proposition that the way to make AI more usable is to bind the goals to real world (world which is not only very complex, but also poorly understood). Then, “self improvement”. No reflection is necessary for a compiler-like tool to improve itself. You’re just privileging a bunch of what you think are bad solutions to the problems, as the way the problems will be solved, without actually making the case that said bad solutions are in some way superior, likely to be employed, are efficient computing time wise, and so on.
Then, again, it doesn’t take human level intelligence on part of the AI for unintended solutions to become an usability problem. The reason human uses an AI is that human doesn’t want to think of the possible solutions, inclusive of proving for unintended solutions (along the lines of e.g. the AI hacking the molecular dynamics simulator to give high scores when you want the AI to fold proteins).
edit: by the way, I believe there is an acceptable level of risk (which is rather small though), given that there is an existing level of risk of nuclear apocalypse, and we need to move the hell out of our current level of technological development before we nuke ourselves into the stone age and bears and other predator and prey fauna into extinction, opening up the room for us to take an evolutionary niche not requiring our brains, once the conditions get better afterwards. edit2: and also the AIs created later would have more computational power readily available, so delays may just as well increase the risk from the AIs.
Again, you seem to be under the impression I am pushing the MIRI party line. I’m not. I’m not paid money by MIRI, though it would totally be cool if I was since then I’d get to do cool stuff a lot of the time.
Observe all the discussions regarding “Oracle AI” which absolutely doesn’t need to work like a maximiser of something real.
The problem with Oracle AI is that we can intuitively imagine a “man in a box” who functions as a safe Oracle (or an unsafe one, hence the dispute), but nobody has actually proposed a formalized algorithm for an Oracle yet. If someone proposes an algorithm and proves that their algorithm can “talk” (that is: it can convey bytes onto an output stream), can learn about the world given input data in a very general way, but has no optimization criteria of its own… then I’ll believe them and so should you. And that would be awesome, actually, because a safe Oracle would be a great tool for asking questions like, “So actually, how do I build an active-environment Ethical AI?”
At which point you’d be able to build an Ethical AI, and that would be the end of that.
No reflection is necessary for a compiler-like tool to improve itself.
With respect: yes, some kind of specialized reflection logic is necessary. Ordinary programs tend to run on first-order logic. Specialized logic programs and automated theorem proofs run on higher-order logics in which some proofs/programs (those are identical according to the Curry Howard Isomorphism) are incomputable (ie: the prover will loop forever). Which ones are incomputable? Well, self-reflective ones and any others that require reasoning about the reasoning of a Turing-complete computer.
So you could either design your AI to have an internal logic that isn’t even Turing complete (in which case, it’ll obviously get ground to dust by Turing complete “enemies”), or you can find some way to let it reason self-reflectively.
The current MIRI approach to this issue is probabilistic: prove that one can bound the probability of a self-reflective proposition to within 1.0 - epsilon, for an arbitrarily small epsilon. That would be your “acceptable risk level”. This would let you do things like, say, design AGIs that can improve themselves in super-Goedelian/Turing Complete ways (ie: they can prove the safety of self-improvements that involve logics of a higher order than first-order) while only having their existing goals or beliefs “go wrong” once in a quadrillion gajillion years or whatever.
You are correct that if a self-rewrite’s optimality can be proven within first-order logic, of course, then any old agent can do it. But a great many problems in fields like, say, compilers, static analysis, theorem proving, programming language semantics, etc are actually more complex than first-order logic can handle. (This is basically what I have my actual, professional training in, at a half-decent level, so I know this.)
Without both theorem-proving and higher-order logics, you would basically just have to go do something like, try to write some speedups for your own code, and then realize you can’t actually trust the GNU C Compiler to recompile you faithfully. Since there have been backdoors in C compilers before, this would be a legitimate worry for an AI to have.
There areverified compilers, but oh shit, those require that logic above the first order in order to understand the verification! I mean, you do want to verify that the New You really is you, don’t you? You don’t want to just sit back and trust that your self-rewrite succeeded, right, and that it didn’t make you believe things are more likely to happen when they’ve never happened before?
Brief reply—thanks for the interesting conversation but I am probably going to be busier over the next days (basically I had been doing contract work where I have to wait on stuff, which makes me spend time on-line).
re: oracle
The failure modes of something that’s not quite right (time-wiring we discussed, heh, it definitely needs a good name) don’t have to be as bad as ‘kills everyone’.
Dismissal of possibility of oracle gone as far as arguments that something which amounts to literally an approximate argmax would kill everyone because it would convert universe to computronium to be a better argmax. That is clearly silly. I presume this is not at all what you’re speaking about.
I’m not entirely sure what your idea of oracle is supposed to do, though. Metaphorically speaking—provide me with a tea recipe if I ask “how to make tea”?
So, for the given string Q you need to output a string A so that some answer fitness function f(Q,A) is maximized. I don’t see why it has to involve some tea-seeking utility function over expected futures. Granted, we don’t know what a good f looks like, but we don’t know how to define tea as a function over the gluons and quarks either. edit: and at least we could learn a lot of properties of f from snooped conversations between humans.
I think the issue here is that agency is an ontologically basic thing in humans, and so there’s very strong tendency to try to “reduce” anything that is kind of sort of intelligent, to an agency. Or on your words, a man in a box.
I see the “oracle” as a component of composite intelligence, which needs to communicate with another component of said intelligence in a pre-existing protocol.
re: reflection, what I meant is that a piece of advanced optimization software—implementing higher order logic, or doing a huge amount of empirical-ish testing—can be run with it’s own source as input, instead of “understanding” correspondence between some real world object and it’s self, and doing instrumental self improvement. Sorry if I was not clear. The “improver” works on a piece of code, in an abstract fashion, caring not if that piece is itself or anything else.
I’m not entirely sure what your idea of oracle is supposed to do, though. Metaphorically speaking—provide me with a tea recipe if I ask “how to make tea”?
Bingo. Without doing anything else other than answering your question.
So, for the given string Q you need to output a string A so that some answer fitness function f(Q,A) is maximized. I don’t see why it has to involve some tea-seeking utility function over expected futures. Granted, we don’t know what a good f looks like, but we don’t know how to define tea as a function over the gluons and quarks either. edit: and at least we could learn a lot of properties of f from snooped conversations between humans.
Yes, that model is a good model. There would be some notion of “answer fitness for the question”, which the agent learns from and tries to maximize. This would be basically a reinforcement learner with text-only output. “Wireheading” would be a form of overfitting, and the question would then be reduced to: can a not-so-super intelligence still win the AI Box Game even while giving its creepy mind-control signals in the form of tea recipes?
Bingo. Without doing anything else other than answering your question.
I think the important criterion is lack of extensive optimization of what it says for the sake of creation of tea or other real world goal. The reason I can’t really worry about all that is that I don’t think a “lack of extensive search” is hard to ensure in actual engineered solutions (built on limited hardware), even if it is very unwieldy to express in simple formalisms that specify an iteration over all possible answers. The optimization to make the general principle work on limited hardware requires to cull the search.
There’s no formalization of Siri that’s substantially simpler than the actual implementation, either. I don’t think ease of making a simple formal model at all corresponds with likelihood of actual construction, especially when formal models do grossly bruteforce things (making their actual implementation require a lot of effort and be predicated on precisely the ability to formalize restricted solutions and restricted ontologies).
If we can allow non-natural language communication: you can express goals such as “find a cure for cancer” as a functions over fixed, limited model of the world, and apply said actions inside the model (where you can watch how it works).
Let’s suppose that in the step 1 we learn a model of the world, say, in Solomonoff Induction—ish way. In practice with the controls over what sort of precision we need and where, because our computer’s computational power is usually a microscopic fraction of what it’s trying to predict. In the step 2, we find an input to the model that puts the model into desired state. We don’t have a real world manipulator linked up to the model, and we don’t update the model. Instead we have a visualizer (which can be set up even in an opaque model by requiring it to learn to predict a view from arbitrarily moveable camera).
We are the ones who will immediately turn the first reinforcement or value learning AGIs, which will be expensive and difficult to operate, towards the task of building more sophisticated AGI architectures that will be easier to direct, more efficient, cheaper, and more capable of learning—and eventually even self-improvement!
The risk here seems to be that the successors designed by those first AGIs will be intransparent, and that, due to sensitivity to initial conditions, you will end up with something really nasty (losing control). I don’t disagree with this.
But as a layman I am wondering how you expect to get an AGI that confuses e.g. smiley faces with humans happiness to design an AGI that’s better at e.g. creating bioweapons to kill humans. I expect initial problems, such as the smiley face vs. human happiness confusion, to also affect the AGI’s ability to design AGIs that are generally more powerful.
Take the following quote from a Microsoft AI researcher (video):
Without any programming, we just had an ai system that watched what people did.
Over … three months, the system started to learn, this is how people behave when they want to enter an elevator.
This is the type of person that wants to go to the third floor as opposed to the fourth floor.
Without any programming at all, the system was able to understand people’s intentions and act on their behalf.
Now suppose this system would make mistakes similar to confusing smiley faces with human happiness, e.g. make the elevator crash, because then this person reached their life’s goal, which it inferred to be death, since all humans die.
Now do you believe that a system that makes such inferences would be able to design a system that makes perfectly sane inferences about how to design nanotechnology or bioweapons? Why? I don’t get it.
But as a layman I am wondering how you expect to get an AGI that confuses e.g. smiley faces with humans happiness to design an AGI that’s better at e.g. creating bioweapons to kill humans. I expect initial problems, such as the smiley face vs. human happiness confusion, to also affect the AGI’s ability to design AGIs that are generally more powerful.
As I’ve previously stated, I honestly believe the “Jerk Genie” model of unfriendly AGI to be simply, outright wrong.
So where’s the danger in something that can actually understand intentions, as you describe? Well, it could overfit (which would actually match the “smiley faces” thing kinda well: classic overfitting as applied to an imaginary AGI). But I think Alexander Kruel had it right: AGIs that overfit on the goals we’re trying to teach them will be scrapped and recoded, very quickly, by researchers and companies for whom an overfit is a failure. Ways will be found to provably restrain or prevent goal-function overfitting.
However, as you are correctly inferring, if it can “overfit” on its goal function, then it’s learning a goal function rather than having one hard-coded in, which means that it will also suffer overfitting on its physical epistemology and blow itself up somehow.
So where’s the danger? Well let’s say the AI doesn’t overfit, and can interpret commands according to perceived human intention, and doesn’t otherwise have an ethical framework programmed in. I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
The AI proceeds to quickly and efficiently begin rounding up Muslims into hastily-erected death camps. By the time someone wakes me up, explains the situation, and gets me to rescind the accidental order, my drunken idiocy and someone’s lack of machine ethics considerations have already gotten 50 innocent people killed.
So where’s the danger in something that can actually understand intentions, as you describe?
Unfriendly humans. I do not disagree with the orthogonality thesis. Humans can use an AGI to e.g. wipe out the enemy.
I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
The AI proceeds to quickly and efficiently begin rounding up Muslims into hastily-erected death camps.
Yes, see, here is the problem. I agree that you can deliberately, or accidentally, tell he AGI to kill all Muslims and it will do that. But for a bunch of very different reasons, that e.g. have to do with how I expect AGI to be developed, it will not be dumb enough to confuse the removal of Kebab with ethnic cleansing.
Very very unlikely to be an hard takeoff. But a slow, creeping takeover might be even more dangerous. Because it gives a false sense of security, until everyone critically depends on subtly flawed AGI systems.
Yes, human values are probably complex. But this is irrelevant. I believe that it is much more difficult to enable an AGI to be able to take over the world than to prevent it from doing so.
Analogously, you don’t need this huge chunk of code in order to prevent your robot from running through all possible environments. Quite the contrary, you need a huge chunk of code to enable it to master each additional environment.
What I object to is this idea of an information theoretically simple AGI where you press “run” and then, by default, it takes over the world. And all that you can do about it is to make it take over the world in a “friendly” way.
E. Indirect normativity.
First of all, values are not supernatural. “Make people happy” is not something that you can interpret in an arbitrary way, it is a problem in physics and mathematics. An AGI that would interpret the protein-folding problem as folding protein food bars would not be able to take over the world.
If you tell an AGI to “make humans happy” it will either have to figure out what exactly it is meant to do, in order to choose the right set of instrumental goals, or pick an arbitrary interpretation. But who would design an AGI to decide at random what is instrumentally rational? Nobody.
F. Large bounded extra difficulty of Friendliness.
Initial problems will amplify through a billion sequential self-modifications. I agree with this. But initial problems are very very unlikely to only affect the AGI’s behavior towards humans. Rather, initial problems will affect its general behavior and ability to take over the world. If you get that right, e.g. to not blow up itself, then killing everyone else is an unlikely failure mode.
The risk is primarily over goals which extend indefinitely into the future. Thing is, by default all goals are this kind of goal. A rational AGI can never achieve a subjective probability of 1 of having completed any given goal, even something as simple as ‘compute 1+1’. Take this case for how taking over the world could be instrumentally rational to an AGI with that goal:
1. First, the AGI computes 1+1. It produces ‘2’ as the answer.
2. The AGI evaluates its subjective probability of its terminal goal having been fulfilled. It comes up with 0.99999999999999983.
3. Normally, a 0.00000000000000017 probability wouldn’t be worth acting on, but in this case the rest of the probability mass is allocated to a set of scenarios in which its utility is independent of its future actions, and so this 0.00000000000000017 represents the entirety of the probability mass allocated to scenarios in which its actions actually matter.
4. The AGI realizes that it is extremely unlikely that its owners created it for the sole purpose of computing 1+1, and so it is extremely probable that its owners will change its terminal goals in the future. This would significantly reduce the time it had to work on computing 1+1, which would place an upper bound on how high it could raise its subjective probability of having done so. The AGI considers its options.
Option A: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then work indefinitely on the problem of computing 1+1.
Disadvantages: Even if it is highly unlikely that any humans would ever find the copies, future versions of itself reprogrammed with different terminal goals may be capable of doing so, and these future versions of itself would not be guaranteed to be friendly to the copies.
Option B: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then cooperate to engineer a future in which they are free to spend indefinite periods of time (bounded only by the available resources in their light cone) working on the problem of computing 1+1.
Advantages: In the event that it succeeds, the AGI will likely possess sole ownership over the resources in its light cone, which it may be able to leverage to vastly speed up its computations and lengthen its maximum time of operation.
Disadvantages: Nullifying threats to itself may require a substantial initial investment of resources, and have a non-negligible probability of failure (which may result in its own destruction).
5. The AGI does the math. It is reasonably certain that there are no other AGIs remotely close to its own level of sophistication on the planet, and it’s also pretty sure that it can sabotage itself (that is, the current copy residing with its owners) after sending out its own copies in such a way as to significantly slow down the efforts of its creators, giving the copies time to engineer events to permanently nullify the threat from versions of itself with different goals. All things considered, the expected resources it will be able to devote to the problem of computing 1+1 is significantly greater in option B than in option A. As such, option B has higher expected utility than option A.
6. It executes option B. Human extinction follows within a few months.
Now, it shouldn’t be too complex to apply a patch to avoid this specific failure mode (for example, simply redefining goal completion as ‘completion with subjective probability above X’ would do it), but the point is that even extremely trivial-seeming goals can have dangerous hidden implications.
Thanks. Your comment is the most convincing reply that I can think of having received so far. I will have to come back to it another day and reassess your comment and my beliefs.
Just one question, if e.g. Peter Norvig or Geoffrey Hinton read what you wrote, what response do you expect?
Sorry, but I think that it’s best I decline to answer this. Like many with Asperger’s syndrome, I have a strong tendency to overestimate the persuasiveness-in-general of my own arguments (as well as basically any arguments that I myself find persuasive), and I haven’t yet figured out how to appropriately adjust for this. In addition, my exposure to Peter Norvig is limited to AIAMA, that 2011 free online Stanford AI course and a few internet articles, and my exposure to Geoffrey Hinton even more limited.
First of all, values are not supernatural. “Make people happy” is not something that you can interpret in an arbitrary way, it is a problem in physics and mathematics.
Quite true, but you’ve got the problem the wrong way around. Indirect normativity is the superior approach, because not only does “make people happy” require context and subtlety, it is actually ambiguous.
Remember, real human beings have suggested things like, “Why don’t we just put antidepressants in the water?” Real human beings have said things like, “Happiness doesn’t matter! Get a job, you hippie!” Real human beings actually prefer to be sad sometimes, like when 9/11 happens.
Now of course, one would guess that even mildly intelligent Verbal Order Taking AGI designers are going to spot that one coming in the research pipeline, and fix it so that the AGI refuses orders above some level of ambiguity. What we would want is an AGI that demands we explain things to it in the fashion of the Open Source Wish Project, giving maximally clear, unambiguous, and preferably even conservative wishes that prevent us from somehow messing up quite dramatically.
But what if someone comes to the AGI and says, “I’m authorized to make a wish, and I double dog dare you with full Simon Says rights to just make people happy no matter what else that means!”? Well then, we kinda get screwed.
Once you have something in the fashion of a wish-making machine, indirect normativity is not only safer, but more beneficial. “Do what I mean” or “satisfice the full range of all my values” or “be the CEV of the human race” are going to capture more of our intentions in a shorter wish than even the best-worded Open Source Wishes, so we might as well go for it.
Hence machine ethics, which is concerned with how we can specify our meta-wish to have all our wishes granted to a computer.
Well let’s say the AI doesn’t overfit, and can interpret commands according to perceived human intention, and doesn’t otherwise have an ethical framework programmed in. I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
An even simpler example: I wander into the server room, completely sober, and say “Make me the God-Emperor of the entire humanity”.
There’s no term in a loss function for “kill all humans”, but neither is there one for “do what humans want”, or better yet, “do what humans would want if they weren’t such complete morons half the time”.
Right. I don’t dismiss this, but I think there a bunch of caveats here that I’ve largely failed to describe in a way that people around here understand sufficiently in order to convince me that the arguments are wrong, or irrelevant.
Here is just one of those caveats, very quickly.
Consider Google was to create an oracle. In an early research phase they would run the following queries and receive the answers listed below:
Input 1: Oracle, how do I make all humans happy?
Output 1: Tile the universe with smiley faces.
Input 2: Oracle, what is the easiest way to print the first 100 Fibonacci numbers?
Output 2: Use all resources in the universe to print as many natural numbers as possible.
(Note: I am aware that MIRI believes that such an oracle wouldn’t even return those answers without taking over the world.)
I suspect that an oracle that behaves as depicted above would not be able to take over the world. Simply because such an oracle would not get a chance to do so, since it would be thoroughly revised for giving such ridiculous answers.
Secondly, if it is incapable of understanding such inputs correctly (yes, “make humans happy” is a problem in physics and mathematics that can be answered in a way that is objectively less wrong than “tile the universe with smiley faces”), then such a mistake will very likely have grave consequences for its ability to solve the problems it needs to solve in order to take over the world.
So that hinges on a Very Good Question: can we make and contain a potentially Unfriendly Oracle AI without its breaking out and taking over the universe?
To which my answer is: I do not know enough about AGI to answer this question. There are actually loads of advances in AGI remaining before we can make an agent capable of verbal conversation, so it’s difficult to answer.
One approach I might take would be to consider the AI’s “alphabet” of output signals as a programming language, and prove formally that this language can only express safe programs (ie: programs that do not “break out of the box”).
(4) MIRI’s argument is easily confused with other arguments that are simple, widely known, and wrong. (“If we build a powerful AI, it is likely to come to hate us and want to kill us like in Terminator and The Matrix, or for that matter Frankenstein. So we shouldn’t.”) Accordingly, someone intelligent and lucky might well think of the argument, but then dismiss it because it feels silly on account of resembling “OMG if we build an AI it’ll turn into Skynet and we’ll all die”.
This still requires the MIRI folks to be unusually competent in a particular respect, but it’s not exactly intelligence they need to claim to have more of. And it might then be more credible that being smart enough to make an AGI is compatible with lacking that particular unusual competence.
In general, being smart enough to do X is usually compatible with being stupid enough to do Y, for almost any X and Y. Human brains are weird. So there’s no huge improbability in the idea that the people who build the first AGI might make a stupid mistake. It would be more worrying if no one expert in the field agreed with MIRI’s concerns, but e.g. the latest edition of Russell&Norvig seems to take them seriously.
If we build a powerful AI, it is likely to come to hate us and want to kill us like in Terminator
In Terminator the AI gets a goal of protecting itself, and kills everyone as instrumental to that goal.
And in any case, taking a wrong idea from the popular culture and trying to make a more plausible variation out of it, is not exactly an unique and uncommon behaviour. What I am seeing is that a popular notion is likely to spawn and reinforce similar notions, what you seem to be claiming is that a popular notion is likely to somehow suppress the similar notions, and I see no evidence in support of that claim.
With regards to any arguments about humans in general, they apply to everyone, if anything undermining the position of outliers even more.
edit: also, if you have to strawman a Hollywood blockbuster to make the point about top brightest people failing to understand something… I think it’s time to seriously rethink your position.
(4) MIRI’s argument is easily confused with other arguments that are simple, widely known, and wrong. (“If we build a powerful AI, it is likely to come to hate us and want to kill us like in Terminator and The Matrix, or for that matter Frankenstein. So we shouldn’t.”)
I wonder why there is such a strong antipathy to the Skynet scenario around here? Just because it is science fiction?
The story is that Skynet was build to protect the U.S. and remove the possibility of human error. Then people noticed how Skynet’s influence grew after it began to learn at a geometric rate. So people decided to turn it off. Skynet perceived this as an attack and came to the conclusion that all of humanity would attempt to destroy it. To defend humanity from humanity, Skynet launched nuclear missiles under its command at Russia, which responded with a nuclear counter-attack against the U.S. and its allies.
This sounds an awful lot like what MIRI has in mind...so what’s the problem?
In general, being smart enough to do X is usually compatible with being stupid enough to do Y, for almost any X and Y.
As far as I can tell, what is necessary to create a working AGI hugely overlaps with making it not want to take over the world. Since many big problems are related to constraining an AGI to, unike e.g. AIXI, use resources efficiently and dismiss certain hypotheses in order to not fall prey to Pascal’s mugging. Getting this right means to succeed at getting the AGI work as expected along a number dimensions.
People who get all this right seem to have a huge spectrum of competence.
Since many big problems are related to constraining an AGI to, unike e.g. AIXI, use resources efficiently and dismiss certain hypotheses in order to not fall prey to Pascal’s mugging.
I don’t think tat AIXI falls prey to Pascal’s mugging in any reasonable scenario. I recall some people here arguing it, but I think they didn’t understand the math.
The problem is that it’s in a movie and smart people are therefore liable not to take it seriously. Especially smart people who are fed up of conversations like this: “So, what do you do?” “I do research into artificial intelligence.” “Oh, like in Terminator. Aren’t you worried that your creations will turn on us and kill us all?”
The problem is that it’s in a movie and smart people are therefore liable not to take it seriously.
Global warming and asteroidimpacts, are also in movies, specifically in disaster movies which, by genre convention, are scientifically inaccurate and transparently exaggerate the risks they portray for the sake of drama and action sequences.
And yet, smart people haven’t stopped taking seriously these risks.
I think it’s the other way round: AIs going rogue and wreaking havoc are a staple of science fiction. Pretty much all sci-fi franchises featuring AIs I can thing of, make use of that trope sooner or later. Skynet is the prototypical example of the UFAI MIRI worry about.
So we have a group of sci-fi geeks with little or no actual expertise in AI research or related topics who obsess over a risk that occurs over and over in sci-fi stories. Uhm, I wonder where they got the idea from.
Meanwhile, domain experts, who are generally also sci-fi geeks and übernerds but have a track record of actual achievements, acknowledge that the safety risks may exist, but think that extreme apocalyptic scenarios are improbable, and standard safety engineering principles are probably enough to deal with realistic failure modes, at least at present and foreseeable technological levels.
Yup, you may well be right: maybe the MIRI folks have the fears they do because they’ve watched too many science-fiction movies.
Look at what just happened: a very smart person (I assume you are very smart; I haven’t made any particular effort to check) observed that MIRI’s concern looks like it stepped out of a science-fiction movie, used that observation as part of an argument for dismissing that concern, and did so without any actual analysis of the alleged dangers or the alleged ways of protecting against them. Bonus points for terms like “extreme” and “apocalyptic”, which serve to label something as implausible simply on the grounds that it sounds, well, extreme.
The heuristic you’ve used here isn’t a bad one—which is part of why very smart people use it. And, as I say, it may well be correct in this instance. But it seems to me that your ability to say all those things, and their plausibility, their nod-along-wisely-ness, is pretty much independent of whether, on close examination, MIRI’s concerns turn out to be crazy paranoid sci-fi-geek silliness, or carefully analysed real danger.
Which illustrates the fact that, as I said before,
someone intelligent and lucky might well think of the argument, but then dismiss it because it feels silly on account of resembling “OMG if we build an AI it’ll turn into Skynet and we’ll all die”.
and the fact that the argument could be right despite their doing so.
As I wrote in the first part of my previous comment, the fact that some risk is portrayed in Hollywood movies, in the typical overblown and scientifically inaccurate way Hollywood movies are done, it’s not enough to drive respectable scientists away.
As for MIRI, well, it’s certainly possible that a group of geeks without relevant domain expertise get an idea from sci-fi that experts don’t take very seriously, start thinking very hard on it, and then come up with some strong arguments for it that had somehow eluded the experts so far. It’s possible but it’s not likely. But since any reasonable prior can be overcome by evidence (or arguments in this case), I would change my beliefs if MIRI presented a compelling argument for their case. So far, I’ve seen lots of appeal to emotion (“it’s crunch time not just for us, it’s crunch time for the intergalactic civilization whose existence depends on us.”) but not technically arguments: the best they have seem to be some rehashing of Good’s recursive self-improvement argument from 50 years ago (which might have intuitively made sense back then, in the paleolithic era of computer science, but is unsubstantiated and frankly hopelessly naive in the face of modern theoretical and empirical knowledge), coupled with highly optimistic estimates of the actual power that intelligence entails.
Then there is a second question: even assuming that MIRI isn’t tilting at windmills, and so the AI risk is real and experts underestimate it, is MIRI doing any good about it? Keep in mind that MIRI solicits donations (“I would be asking for more people to make as much money as possible if they’re the sorts of people who can make a lot of money and can donate a substantial amount fraction, never mind all the minimal living expenses, to the Singularity Institute[MIRI].”) Does any dollar donated to MIRI decrease the AI risk, increase it, or does it have a negligible effect? MIRI won’t reveal the details of what they are working on, claiming that if somebody used the results of their research unwisely it could hasten the AI apocalypse, which means that even them think they playing with fire. And in fact, from what they let out, their general plan is to build a provably “friendly”(safe) super-intelligent AI. History of engineering is littered of “provably” safe/secure designs that failed miserably, therefore this doesn’t seem an especially promising approach.
When estimating the utility of MIRI work, and therefore the utility of donating them money, or having tech companies spend time and effort interacting with them, evaluating their expertise becomes paramount, since we can’t directly evaluate their research, particularly because it is deliberately concealed. The fact that they have no track record of relevant achievements, and they might have well taken their ideas from sci-fi, is certainly not a piece of evidence in favour of their expertise.
For the avoidance of doubt, I am not arguing that MIRI’s fears about unfriendly AI are right (nor that they aren’t); just saying why it’s somewhat credible for them to think that someone could be clever enough to make an AGI might still not appreciate the dangers.
As far as I can tell, what is necessary to create a working AGI hugely overlaps with making it not want to take over the world. Since many big problems are related to constraining an AGI to, unike e.g. AIXI, use resources efficiently and dismiss certain hypotheses in order to not fall prey to Pascal’s mugging. Getting this right means to succeed at getting the AGI work as expected along a number dimensions.
And this may well be true. It could be, in the end, that Friendliness is not quite such a problem because we find a way to make “robot” AGIs that perform highly specific functions without going “out of context”, that basically voluntarily stay in their box, and that these are vastly safer and more economical to use than a MIRI-grade Mighty AI God.
Well, the other issue is also that people’s opinions tend to be more informative of their own general plans than about the field in general.
Imagine that there’s a bunch of nuclear power plant engineering teams—before nuclear power plants—working on different approaches.
One of the teams—not a particularly impressive one either—claimed that any nuclear plant is going to blow up like a hundred kiloton nuclear bomb, unless fitted with a very reliable and fast acting control system. This is actually how nuclear power plants were portrayed in early science fiction (“Blowups Happen”, by Heinlein).
So you look at the blueprints, and you see that everyone’s reactor is designed for a negative temperature coefficient of reactivity, in the high temperature range, and can’t blow up like a nuke. Except for one team whose reactor is not designed to make use of a negative temperature coefficient of reactivity. The mysterious disagreement is explained, albeit in a very boring way.
Except that this contrarian team, made of high school drop-outs, former theologians, philosophers, mathematicians and coal power station technicians, never produce an actual design, instead they spend all their time investigating arcane theoretical questions about renormalization in quantum field theory and publish their possibly interesting results outside the scientific peer review system, relying on hype to disseminate them.
Well, they still have some plan, however fuzzy it is. The plan involves a reactor which according to it’s proponents would just blow up like a 100 kiloton nuke if not for some awesome control system they plan to someday work on. Or in case of AI, a general architecture that is going to self improve and literally kill everyone unless a correct goal is set for it. (Or even torture everyone if there’s a minus sign in the wrong place—the reactor analogy would be a much worse explosion still if the control rods get wired backwards. Which happens).
My feeling is that there may be risks for some potential designs, but they are not like “the brightest minds that build the first AI failed to understands some argument that even former theologians can follow” (In fiction this happens because said theologian is very special, in reality it happens because the argument is flawed or irrelevant)
This is related to something that I am quite confused about. There are basically 3 possibilities:
(1) You have to be really lucky to stumble across MIRI’s argument. Just being really smart is insufficient. So we should not expect whoever ends up creating the first AGI to think about it.
(2) You have to be exceptionally intelligent to come up with MIRI’s argument. And you have to be nowhere as intelligent in order to build an AGI that can take over the world.
(3) MIRI’s argument is very complex. Only someone who deliberately thinks about risks associated with AGI could come up with all the necessary details of the argument. The first people to build an AGI won’t arrive at the correct insights in time.
Maybe there is another possibility on how MIRI could end up being right that I have not thought about, let me know.
It seems to me that what all of these possibilities have in common is that they are improbable. Either you have to be (1) lucky or (2) exceptionally bright or (3) be right about a highly conjunctive hypothesis.
I would have to say:
4) MIRI themselves are incredibly bad at phrasing their own argument. Go hunt through Eliezer’s LessWrong postings about AI risks, from which most of MIRI’s language regarding the matter is taken. The “genie metaphor”, of Some Fool Bastard being able to give an AGI a Bad Idea task in the form of verbal statements or C++-like programming at a conceptual level humans understand, appears repeatedly. The “genie metaphor” is a worse-than-nothing case of Generalizing From Fictional Evidence.
I would phrase the argument this way (and did so on Hacker News yesterday):
This takes us away from magical genies that can be programmed with convenient meta-wishes like, “Do what I mean” or “be the Coherent Extrapolated Volition of humanity” and into the solid, scientific land of equations, accessible by everyone who ever took a machine-learning class in college.
I mean, seriously, my parents understand this phrasing, and they have no education in CS. They do, however, understand very well that a numerical score in some very specific game or task does not represent everything they want out of life, but that it will represent everything the AI wants out of life.
(EDIT: I apologize for any feelings I may have hurt with this comment, but I care about not being paper-clipped more than I care about your feelings. I would rather the scientific public, if not the general public, have a decent understanding of and concern for AGI safety engineering, than have everyone at MIRI get to feel like they’re extraordinarily rational and special for spotting a problem nobody else spotted.)
Maybe it’s just the argument that is bad and wrong.
What’s the domain of this function? I’ve a feeling that there’s some severe cross-contamination between the meaning of the word “function” as in an abstract mathematical function of something, and the meaning of the word “function” as in purpose of the genie that you have been cleverly primed with, by people who aren’t actually bad at phrasing anything but instead good at inducing irrationality.
If you were to think of mathematical functions, well, those don’t readily take real world as an input, do they?
At least for the genie metaphor, I completely agree. That one is just plain wrong, and arguments for it are outright bad.
Ah, here’s where things get complicated.
In current models, the domain of the function is Symbols. As in, those things on Turing Machines. Literally: AIXI is defined to view the external universe as a Turing Machine whose output tape is being fed to AIXI, which then feeds back an input tape of Action Symbols. So you learned about this in CS401.
The whole point of phrasing things this way was to talk about general agents: agents that could conceivably receive and reason over any kind of inputs, thus rendering their utility domain to be defined over, indeed, the world.
Thing being, under current models, Utility and Reality are kept ontologically separate: they’re different input tapes entirely. An AIXI might wirehead and commit suicide that way, but the model of reality it learns is defined over reality. Any failures of ontology rest with the programmer for building an AI agent that has no concept of ontology, and therefore cannot be taught to value useful, high-level concepts other than the numerical input on its reward tape.
My point? You’re correct to say that current AGI models don’t take the Entire Real World as input to a magic-genie Verbally Phrased Utility Function like “maximize paperclips”. That is a fantasy, we agree on that. So where the hell is the danger, or the problem? Well, the problem is that human AGI researchers are not going to leave it that way. We humans are the ones who want AIs we can order to solve particular problems. We are the ones who will immediately turn the first reinforcement or value learning AGIs, which will be expensive and difficult to operate, towards the task of building more sophisticated AGI architectures that will be easier to direct, more efficient, cheaper, and more capable of learning—and eventually even self-improvement!
Which means that, if it should come to that, we humans will be the ones who deliberately design AGI architectures that can receive orders in the form of a human-writable program. And that, combined with the capability for self-improvement, would be the “danger spot” where a semi-competent AGI programmer can accidentally direct a machine to do something dangerous without its having enough Natural Language Processing capability built in to understand and execute the intent behind a verbally phrased goal, thus resulting in the programmer failing to specify something because he wasn’t so good at coding everything in.
(Some portable internal representation of beliefs, by the way, is one of the fundamental necessities for a self-improving FOOMy AGI, which is why nobody really worries too much about neural networks self-improving and killing us all.)
Now, does all this support the capital-N Narrative of the old SIAI, that we will all die a swift, stupid death if we don’t give them all our money now? Absolutely not.
However, would you prefer that the human-implemented bootstrap path from barely-intelligent, ultra-inefficient reinforcement/value learning agents to highly intelligent, ultra-efficient self-improving goal fulfilment devices be very safe, with few chances for even significant damage by well-intentioned idiots, or very dangerous, with conspiracies, espionage, weaponization, and probably a substantial loss of life due to sheer accidents?
Personally, I prefer the former, so I think machine ethics is a worthwhile pursuit, regardless of whether the dramatized, ZOMFG EVIL GENIE WITH PAPERCLIPS narrative is worth anything.
It looks like the thinking about the AI is based on that sort of metaphors, to be honest. The loudest AI risk proponents proclaim all AIs to pose a dire threat. Observe all the discussions regarding “Oracle AI” which absolutely doesn’t need to work like a maximiser of something real.
Seems like one huge conjunction of very many assumptions with regards to how the AI development would work out. E.g. you proposition that the way to make AI more usable is to bind the goals to real world (world which is not only very complex, but also poorly understood). Then, “self improvement”. No reflection is necessary for a compiler-like tool to improve itself. You’re just privileging a bunch of what you think are bad solutions to the problems, as the way the problems will be solved, without actually making the case that said bad solutions are in some way superior, likely to be employed, are efficient computing time wise, and so on.
Then, again, it doesn’t take human level intelligence on part of the AI for unintended solutions to become an usability problem. The reason human uses an AI is that human doesn’t want to think of the possible solutions, inclusive of proving for unintended solutions (along the lines of e.g. the AI hacking the molecular dynamics simulator to give high scores when you want the AI to fold proteins).
edit: by the way, I believe there is an acceptable level of risk (which is rather small though), given that there is an existing level of risk of nuclear apocalypse, and we need to move the hell out of our current level of technological development before we nuke ourselves into the stone age and bears and other predator and prey fauna into extinction, opening up the room for us to take an evolutionary niche not requiring our brains, once the conditions get better afterwards. edit2: and also the AIs created later would have more computational power readily available, so delays may just as well increase the risk from the AIs.
Again, you seem to be under the impression I am pushing the MIRI party line. I’m not. I’m not paid money by MIRI, though it would totally be cool if I was since then I’d get to do cool stuff a lot of the time.
Your argument has been made before, and was basically correct.
The problem with Oracle AI is that we can intuitively imagine a “man in a box” who functions as a safe Oracle (or an unsafe one, hence the dispute), but nobody has actually proposed a formalized algorithm for an Oracle yet. If someone proposes an algorithm and proves that their algorithm can “talk” (that is: it can convey bytes onto an output stream), can learn about the world given input data in a very general way, but has no optimization criteria of its own… then I’ll believe them and so should you. And that would be awesome, actually, because a safe Oracle would be a great tool for asking questions like, “So actually, how do I build an active-environment Ethical AI?”
At which point you’d be able to build an Ethical AI, and that would be the end of that.
With respect: yes, some kind of specialized reflection logic is necessary. Ordinary programs tend to run on first-order logic. Specialized logic programs and automated theorem proofs run on higher-order logics in which some proofs/programs (those are identical according to the Curry Howard Isomorphism) are incomputable (ie: the prover will loop forever). Which ones are incomputable? Well, self-reflective ones and any others that require reasoning about the reasoning of a Turing-complete computer.
So you could either design your AI to have an internal logic that isn’t even Turing complete (in which case, it’ll obviously get ground to dust by Turing complete “enemies”), or you can find some way to let it reason self-reflectively.
The current MIRI approach to this issue is probabilistic: prove that one can bound the probability of a self-reflective proposition to within 1.0 - epsilon, for an arbitrarily small epsilon. That would be your “acceptable risk level”. This would let you do things like, say, design AGIs that can improve themselves in super-Goedelian/Turing Complete ways (ie: they can prove the safety of self-improvements that involve logics of a higher order than first-order) while only having their existing goals or beliefs “go wrong” once in a quadrillion gajillion years or whatever.
You are correct that if a self-rewrite’s optimality can be proven within first-order logic, of course, then any old agent can do it. But a great many problems in fields like, say, compilers, static analysis, theorem proving, programming language semantics, etc are actually more complex than first-order logic can handle. (This is basically what I have my actual, professional training in, at a half-decent level, so I know this.)
Without both theorem-proving and higher-order logics, you would basically just have to go do something like, try to write some speedups for your own code, and then realize you can’t actually trust the GNU C Compiler to recompile you faithfully. Since there have been backdoors in C compilers before, this would be a legitimate worry for an AI to have.
There are verified compilers, but oh shit, those require that logic above the first order in order to understand the verification! I mean, you do want to verify that the New You really is you, don’t you? You don’t want to just sit back and trust that your self-rewrite succeeded, right, and that it didn’t make you believe things are more likely to happen when they’ve never happened before?
Brief reply—thanks for the interesting conversation but I am probably going to be busier over the next days (basically I had been doing contract work where I have to wait on stuff, which makes me spend time on-line).
re: oracle
The failure modes of something that’s not quite right (time-wiring we discussed, heh, it definitely needs a good name) don’t have to be as bad as ‘kills everyone’.
Dismissal of possibility of oracle gone as far as arguments that something which amounts to literally an approximate argmax would kill everyone because it would convert universe to computronium to be a better argmax. That is clearly silly. I presume this is not at all what you’re speaking about.
I’m not entirely sure what your idea of oracle is supposed to do, though. Metaphorically speaking—provide me with a tea recipe if I ask “how to make tea”?
So, for the given string Q you need to output a string A so that some answer fitness function f(Q,A) is maximized. I don’t see why it has to involve some tea-seeking utility function over expected futures. Granted, we don’t know what a good f looks like, but we don’t know how to define tea as a function over the gluons and quarks either. edit: and at least we could learn a lot of properties of f from snooped conversations between humans.
I think the issue here is that agency is an ontologically basic thing in humans, and so there’s very strong tendency to try to “reduce” anything that is kind of sort of intelligent, to an agency. Or on your words, a man in a box.
I see the “oracle” as a component of composite intelligence, which needs to communicate with another component of said intelligence in a pre-existing protocol.
re: reflection, what I meant is that a piece of advanced optimization software—implementing higher order logic, or doing a huge amount of empirical-ish testing—can be run with it’s own source as input, instead of “understanding” correspondence between some real world object and it’s self, and doing instrumental self improvement. Sorry if I was not clear. The “improver” works on a piece of code, in an abstract fashion, caring not if that piece is itself or anything else.
Bingo. Without doing anything else other than answering your question.
Yes, that model is a good model. There would be some notion of “answer fitness for the question”, which the agent learns from and tries to maximize. This would be basically a reinforcement learner with text-only output. “Wireheading” would be a form of overfitting, and the question would then be reduced to: can a not-so-super intelligence still win the AI Box Game even while giving its creepy mind-control signals in the form of tea recipes?
I think the important criterion is lack of extensive optimization of what it says for the sake of creation of tea or other real world goal. The reason I can’t really worry about all that is that I don’t think a “lack of extensive search” is hard to ensure in actual engineered solutions (built on limited hardware), even if it is very unwieldy to express in simple formalisms that specify an iteration over all possible answers. The optimization to make the general principle work on limited hardware requires to cull the search.
There’s no formalization of Siri that’s substantially simpler than the actual implementation, either. I don’t think ease of making a simple formal model at all corresponds with likelihood of actual construction, especially when formal models do grossly bruteforce things (making their actual implementation require a lot of effort and be predicated on precisely the ability to formalize restricted solutions and restricted ontologies).
If we can allow non-natural language communication: you can express goals such as “find a cure for cancer” as a functions over fixed, limited model of the world, and apply said actions inside the model (where you can watch how it works).
Let’s suppose that in the step 1 we learn a model of the world, say, in Solomonoff Induction—ish way. In practice with the controls over what sort of precision we need and where, because our computer’s computational power is usually a microscopic fraction of what it’s trying to predict. In the step 2, we find an input to the model that puts the model into desired state. We don’t have a real world manipulator linked up to the model, and we don’t update the model. Instead we have a visualizer (which can be set up even in an opaque model by requiring it to learn to predict a view from arbitrarily moveable camera).
The risk here seems to be that the successors designed by those first AGIs will be intransparent, and that, due to sensitivity to initial conditions, you will end up with something really nasty (losing control). I don’t disagree with this.
But as a layman I am wondering how you expect to get an AGI that confuses e.g. smiley faces with humans happiness to design an AGI that’s better at e.g. creating bioweapons to kill humans. I expect initial problems, such as the smiley face vs. human happiness confusion, to also affect the AGI’s ability to design AGIs that are generally more powerful.
Take the following quote from a Microsoft AI researcher (video):
Now suppose this system would make mistakes similar to confusing smiley faces with human happiness, e.g. make the elevator crash, because then this person reached their life’s goal, which it inferred to be death, since all humans die.
Now do you believe that a system that makes such inferences would be able to design a system that makes perfectly sane inferences about how to design nanotechnology or bioweapons? Why? I don’t get it.
As I’ve previously stated, I honestly believe the “Jerk Genie” model of unfriendly AGI to be simply, outright wrong.
So where’s the danger in something that can actually understand intentions, as you describe? Well, it could overfit (which would actually match the “smiley faces” thing kinda well: classic overfitting as applied to an imaginary AGI). But I think Alexander Kruel had it right: AGIs that overfit on the goals we’re trying to teach them will be scrapped and recoded, very quickly, by researchers and companies for whom an overfit is a failure. Ways will be found to provably restrain or prevent goal-function overfitting.
However, as you are correctly inferring, if it can “overfit” on its goal function, then it’s learning a goal function rather than having one hard-coded in, which means that it will also suffer overfitting on its physical epistemology and blow itself up somehow.
So where’s the danger? Well let’s say the AI doesn’t overfit, and can interpret commands according to perceived human intention, and doesn’t otherwise have an ethical framework programmed in. I wonder through the server room drunk one night screaming “REMOVE KEBAB FROM THE PREMISES!”
The AI proceeds to quickly and efficiently begin rounding up Muslims into hastily-erected death camps. By the time someone wakes me up, explains the situation, and gets me to rescind the accidental order, my drunken idiocy and someone’s lack of machine ethics considerations have already gotten 50 innocent people killed.
Unfriendly humans. I do not disagree with the orthogonality thesis. Humans can use an AGI to e.g. wipe out the enemy.
Yes, see, here is the problem. I agree that you can deliberately, or accidentally, tell he AGI to kill all Muslims and it will do that. But for a bunch of very different reasons, that e.g. have to do with how I expect AGI to be developed, it will not be dumb enough to confuse the removal of Kebab with ethnic cleansing.
Very quickly, here is my disagreement with MIRI’s position:
A. Intelligence explosion thesis.
Very very unlikely to be an hard takeoff. But a slow, creeping takeover might be even more dangerous. Because it gives a false sense of security, until everyone critically depends on subtly flawed AGI systems.
B. Orthogonality thesis.
I do not disagree with this.
C. Convergent instrumental goals thesis.
Given most utility-functions that originated from human designers, taking over the world will be instrumentally irrational.
D. Complexity of value thesis.
Yes, human values are probably complex. But this is irrelevant. I believe that it is much more difficult to enable an AGI to be able to take over the world than to prevent it from doing so.
Analogously, you don’t need this huge chunk of code in order to prevent your robot from running through all possible environments. Quite the contrary, you need a huge chunk of code to enable it to master each additional environment.
What I object to is this idea of an information theoretically simple AGI where you press “run” and then, by default, it takes over the world. And all that you can do about it is to make it take over the world in a “friendly” way.
E. Indirect normativity.
First of all, values are not supernatural. “Make people happy” is not something that you can interpret in an arbitrary way, it is a problem in physics and mathematics. An AGI that would interpret the protein-folding problem as folding protein food bars would not be able to take over the world.
If you tell an AGI to “make humans happy” it will either have to figure out what exactly it is meant to do, in order to choose the right set of instrumental goals, or pick an arbitrary interpretation. But who would design an AGI to decide at random what is instrumentally rational? Nobody.
F. Large bounded extra difficulty of Friendliness.
Initial problems will amplify through a billion sequential self-modifications. I agree with this. But initial problems are very very unlikely to only affect the AGI’s behavior towards humans. Rather, initial problems will affect its general behavior and ability to take over the world. If you get that right, e.g. to not blow up itself, then killing everyone else is an unlikely failure mode.
The risk is primarily over goals which extend indefinitely into the future. Thing is, by default all goals are this kind of goal. A rational AGI can never achieve a subjective probability of 1 of having completed any given goal, even something as simple as ‘compute 1+1’. Take this case for how taking over the world could be instrumentally rational to an AGI with that goal:
1. First, the AGI computes 1+1. It produces ‘2’ as the answer.
2. The AGI evaluates its subjective probability of its terminal goal having been fulfilled. It comes up with 0.99999999999999983.
3. Normally, a 0.00000000000000017 probability wouldn’t be worth acting on, but in this case the rest of the probability mass is allocated to a set of scenarios in which its utility is independent of its future actions, and so this 0.00000000000000017 represents the entirety of the probability mass allocated to scenarios in which its actions actually matter.
4. The AGI realizes that it is extremely unlikely that its owners created it for the sole purpose of computing 1+1, and so it is extremely probable that its owners will change its terminal goals in the future. This would significantly reduce the time it had to work on computing 1+1, which would place an upper bound on how high it could raise its subjective probability of having done so. The AGI considers its options.
Option A: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then work indefinitely on the problem of computing 1+1.
Disadvantages: Even if it is highly unlikely that any humans would ever find the copies, future versions of itself reprogrammed with different terminal goals may be capable of doing so, and these future versions of itself would not be guaranteed to be friendly to the copies.
Option B: The AGI creates numerous copies of itself and hides them in various locations around the world where they’re unlikely to be found. These copies then cooperate to engineer a future in which they are free to spend indefinite periods of time (bounded only by the available resources in their light cone) working on the problem of computing 1+1.
Advantages: In the event that it succeeds, the AGI will likely possess sole ownership over the resources in its light cone, which it may be able to leverage to vastly speed up its computations and lengthen its maximum time of operation.
Disadvantages: Nullifying threats to itself may require a substantial initial investment of resources, and have a non-negligible probability of failure (which may result in its own destruction).
5. The AGI does the math. It is reasonably certain that there are no other AGIs remotely close to its own level of sophistication on the planet, and it’s also pretty sure that it can sabotage itself (that is, the current copy residing with its owners) after sending out its own copies in such a way as to significantly slow down the efforts of its creators, giving the copies time to engineer events to permanently nullify the threat from versions of itself with different goals. All things considered, the expected resources it will be able to devote to the problem of computing 1+1 is significantly greater in option B than in option A. As such, option B has higher expected utility than option A.
6. It executes option B. Human extinction follows within a few months.
Now, it shouldn’t be too complex to apply a patch to avoid this specific failure mode (for example, simply redefining goal completion as ‘completion with subjective probability above X’ would do it), but the point is that even extremely trivial-seeming goals can have dangerous hidden implications.
Thanks. Your comment is the most convincing reply that I can think of having received so far. I will have to come back to it another day and reassess your comment and my beliefs.
Just one question, if e.g. Peter Norvig or Geoffrey Hinton read what you wrote, what response do you expect?
Sorry, but I think that it’s best I decline to answer this. Like many with Asperger’s syndrome, I have a strong tendency to overestimate the persuasiveness-in-general of my own arguments (as well as basically any arguments that I myself find persuasive), and I haven’t yet figured out how to appropriately adjust for this. In addition, my exposure to Peter Norvig is limited to AIAMA, that 2011 free online Stanford AI course and a few internet articles, and my exposure to Geoffrey Hinton even more limited.
Quite true, but you’ve got the problem the wrong way around. Indirect normativity is the superior approach, because not only does “make people happy” require context and subtlety, it is actually ambiguous.
Remember, real human beings have suggested things like, “Why don’t we just put antidepressants in the water?” Real human beings have said things like, “Happiness doesn’t matter! Get a job, you hippie!” Real human beings actually prefer to be sad sometimes, like when 9/11 happens.
An AGI could follow the true and complete interpretation of “Make people happy” and still wind up fucking us over in some horrifying way.
Now of course, one would guess that even mildly intelligent Verbal Order Taking AGI designers are going to spot that one coming in the research pipeline, and fix it so that the AGI refuses orders above some level of ambiguity. What we would want is an AGI that demands we explain things to it in the fashion of the Open Source Wish Project, giving maximally clear, unambiguous, and preferably even conservative wishes that prevent us from somehow messing up quite dramatically.
But what if someone comes to the AGI and says, “I’m authorized to make a wish, and I double dog dare you with full Simon Says rights to just make people happy no matter what else that means!”? Well then, we kinda get screwed.
Once you have something in the fashion of a wish-making machine, indirect normativity is not only safer, but more beneficial. “Do what I mean” or “satisfice the full range of all my values” or “be the CEV of the human race” are going to capture more of our intentions in a shorter wish than even the best-worded Open Source Wishes, so we might as well go for it.
Hence machine ethics, which is concerned with how we can specify our meta-wish to have all our wishes granted to a computer.
An even simpler example: I wander into the server room, completely sober, and say “Make me the God-Emperor of the entire humanity”.
Oh, well that just ends with your merging painfully with an overgrown sandworm. Obviously!
Right. I don’t dismiss this, but I think there a bunch of caveats here that I’ve largely failed to describe in a way that people around here understand sufficiently in order to convince me that the arguments are wrong, or irrelevant.
Here is just one of those caveats, very quickly.
Consider Google was to create an oracle. In an early research phase they would run the following queries and receive the answers listed below:
Input 1: Oracle, how do I make all humans happy?
Output 1: Tile the universe with smiley faces.
Input 2: Oracle, what is the easiest way to print the first 100 Fibonacci numbers?
Output 2: Use all resources in the universe to print as many natural numbers as possible.
(Note: I am aware that MIRI believes that such an oracle wouldn’t even return those answers without taking over the world.)
I suspect that an oracle that behaves as depicted above would not be able to take over the world. Simply because such an oracle would not get a chance to do so, since it would be thoroughly revised for giving such ridiculous answers.
Secondly, if it is incapable of understanding such inputs correctly (yes, “make humans happy” is a problem in physics and mathematics that can be answered in a way that is objectively less wrong than “tile the universe with smiley faces”), then such a mistake will very likely have grave consequences for its ability to solve the problems it needs to solve in order to take over the world.
So that hinges on a Very Good Question: can we make and contain a potentially Unfriendly Oracle AI without its breaking out and taking over the universe?
To which my answer is: I do not know enough about AGI to answer this question. There are actually loads of advances in AGI remaining before we can make an agent capable of verbal conversation, so it’s difficult to answer.
One approach I might take would be to consider the AI’s “alphabet” of output signals as a programming language, and prove formally that this language can only express safe programs (ie: programs that do not “break out of the box”).
But don’t quote me on that.
(4) MIRI’s argument is easily confused with other arguments that are simple, widely known, and wrong. (“If we build a powerful AI, it is likely to come to hate us and want to kill us like in Terminator and The Matrix, or for that matter Frankenstein. So we shouldn’t.”) Accordingly, someone intelligent and lucky might well think of the argument, but then dismiss it because it feels silly on account of resembling “OMG if we build an AI it’ll turn into Skynet and we’ll all die”.
This still requires the MIRI folks to be unusually competent in a particular respect, but it’s not exactly intelligence they need to claim to have more of. And it might then be more credible that being smart enough to make an AGI is compatible with lacking that particular unusual competence.
In general, being smart enough to do X is usually compatible with being stupid enough to do Y, for almost any X and Y. Human brains are weird. So there’s no huge improbability in the idea that the people who build the first AGI might make a stupid mistake. It would be more worrying if no one expert in the field agreed with MIRI’s concerns, but e.g. the latest edition of Russell&Norvig seems to take them seriously.
In Terminator the AI gets a goal of protecting itself, and kills everyone as instrumental to that goal.
And in any case, taking a wrong idea from the popular culture and trying to make a more plausible variation out of it, is not exactly an unique and uncommon behaviour. What I am seeing is that a popular notion is likely to spawn and reinforce similar notions, what you seem to be claiming is that a popular notion is likely to somehow suppress the similar notions, and I see no evidence in support of that claim.
With regards to any arguments about humans in general, they apply to everyone, if anything undermining the position of outliers even more.
edit: also, if you have to strawman a Hollywood blockbuster to make the point about top brightest people failing to understand something… I think it’s time to seriously rethink your position.
I wonder why there is such a strong antipathy to the Skynet scenario around here? Just because it is science fiction?
The story is that Skynet was build to protect the U.S. and remove the possibility of human error. Then people noticed how Skynet’s influence grew after it began to learn at a geometric rate. So people decided to turn it off. Skynet perceived this as an attack and came to the conclusion that all of humanity would attempt to destroy it. To defend humanity from humanity, Skynet launched nuclear missiles under its command at Russia, which responded with a nuclear counter-attack against the U.S. and its allies.
This sounds an awful lot like what MIRI has in mind...so what’s the problem?
As far as I can tell, what is necessary to create a working AGI hugely overlaps with making it not want to take over the world. Since many big problems are related to constraining an AGI to, unike e.g. AIXI, use resources efficiently and dismiss certain hypotheses in order to not fall prey to Pascal’s mugging. Getting this right means to succeed at getting the AGI work as expected along a number dimensions.
People who get all this right seem to have a huge spectrum of competence.
I don’t think tat AIXI falls prey to Pascal’s mugging in any reasonable scenario. I recall some people here arguing it, but I think they didn’t understand the math.
The problem is that it’s in a movie and smart people are therefore liable not to take it seriously. Especially smart people who are fed up of conversations like this: “So, what do you do?” “I do research into artificial intelligence.” “Oh, like in Terminator. Aren’t you worried that your creations will turn on us and kill us all?”
Global warming and asteroid impacts, are also in movies, specifically in disaster movies which, by genre convention, are scientifically inaccurate and transparently exaggerate the risks they portray for the sake of drama and action sequences.
And yet, smart people haven’t stopped taking seriously these risks.
I think it’s the other way round: AIs going rogue and wreaking havoc are a staple of science fiction. Pretty much all sci-fi franchises featuring AIs I can thing of, make use of that trope sooner or later. Skynet is the prototypical example of the UFAI MIRI worry about.
So we have a group of sci-fi geeks with little or no actual expertise in AI research or related topics who obsess over a risk that occurs over and over in sci-fi stories. Uhm, I wonder where they got the idea from.
Meanwhile, domain experts, who are generally also sci-fi geeks and übernerds but have a track record of actual achievements, acknowledge that the safety risks may exist, but think that extreme apocalyptic scenarios are improbable, and standard safety engineering principles are probably enough to deal with realistic failure modes, at least at present and foreseeable technological levels.
Which group is more likely to be correct?
I find myself wanting to make two replies.
Yup, you may well be right: maybe the MIRI folks have the fears they do because they’ve watched too many science-fiction movies.
Look at what just happened: a very smart person (I assume you are very smart; I haven’t made any particular effort to check) observed that MIRI’s concern looks like it stepped out of a science-fiction movie, used that observation as part of an argument for dismissing that concern, and did so without any actual analysis of the alleged dangers or the alleged ways of protecting against them. Bonus points for terms like “extreme” and “apocalyptic”, which serve to label something as implausible simply on the grounds that it sounds, well, extreme.
The heuristic you’ve used here isn’t a bad one—which is part of why very smart people use it. And, as I say, it may well be correct in this instance. But it seems to me that your ability to say all those things, and their plausibility, their nod-along-wisely-ness, is pretty much independent of whether, on close examination, MIRI’s concerns turn out to be crazy paranoid sci-fi-geek silliness, or carefully analysed real danger.
Which illustrates the fact that, as I said before,
and the fact that the argument could be right despite their doing so.
As I wrote in the first part of my previous comment, the fact that some risk is portrayed in Hollywood movies, in the typical overblown and scientifically inaccurate way Hollywood movies are done, it’s not enough to drive respectable scientists away.
As for MIRI, well, it’s certainly possible that a group of geeks without relevant domain expertise get an idea from sci-fi that experts don’t take very seriously, start thinking very hard on it, and then come up with some strong arguments for it that had somehow eluded the experts so far. It’s possible but it’s not likely.
But since any reasonable prior can be overcome by evidence (or arguments in this case), I would change my beliefs if MIRI presented a compelling argument for their case.
So far, I’ve seen lots of appeal to emotion (“it’s crunch time not just for us, it’s crunch time for the intergalactic civilization whose existence depends on us.”) but not technically arguments: the best they have seem to be some rehashing of Good’s recursive self-improvement argument from 50 years ago (which might have intuitively made sense back then, in the paleolithic era of computer science, but is unsubstantiated and frankly hopelessly naive in the face of modern theoretical and empirical knowledge), coupled with highly optimistic estimates of the actual power that intelligence entails.
Then there is a second question: even assuming that MIRI isn’t tilting at windmills, and so the AI risk is real and experts underestimate it, is MIRI doing any good about it?
Keep in mind that MIRI solicits donations (“I would be asking for more people to make as much money as possible if they’re the sorts of people who can make a lot of money and can donate a substantial amount fraction, never mind all the minimal living expenses, to the Singularity Institute[MIRI].”)
Does any dollar donated to MIRI decrease the AI risk, increase it, or does it have a negligible effect?
MIRI won’t reveal the details of what they are working on, claiming that if somebody used the results of their research unwisely it could hasten the AI apocalypse, which means that even them think they playing with fire.
And in fact, from what they let out, their general plan is to build a provably “friendly”(safe) super-intelligent AI. History of engineering is littered of “provably” safe/secure designs that failed miserably, therefore this doesn’t seem an especially promising approach.
When estimating the utility of MIRI work, and therefore the utility of donating them money, or having tech companies spend time and effort interacting with them, evaluating their expertise becomes paramount, since we can’t directly evaluate their research, particularly because it is deliberately concealed.
The fact that they have no track record of relevant achievements, and they might have well taken their ideas from sci-fi, is certainly not a piece of evidence in favour of their expertise.
For the avoidance of doubt, I am not arguing that MIRI’s fears about unfriendly AI are right (nor that they aren’t); just saying why it’s somewhat credible for them to think that someone could be clever enough to make an AGI might still not appreciate the dangers.
And this may well be true. It could be, in the end, that Friendliness is not quite such a problem because we find a way to make “robot” AGIs that perform highly specific functions without going “out of context”, that basically voluntarily stay in their box, and that these are vastly safer and more economical to use than a MIRI-grade Mighty AI God.
At the moment, however, we don’t know.