Dreams of Friendliness
Continuation of: Qualitative Strategies of Friendliness
Yesterday I described three classes of deep problem with qualitative-physics-like strategies for building nice AIs—e.g., the AI is reinforced by smiles, and happy people smile, therefore the AI will tend to act to produce happiness. In shallow form, three instances of the three problems would be:
Ripping people’s faces off and wiring them into smiles;
Building lots of tiny agents with happiness counters set to large numbers;
Killing off the human species and replacing it with a form of sentient life that has no objections to being happy all day in a little jar.
And the deep forms of the problem are, roughly:
A superintelligence will search out alternate causal pathways to its goals than the ones you had in mind;
The boundaries of moral categories are not predictively natural entities;
Strong optimization for only some humane values, does not imply a good total outcome.
But there are other ways, and deeper ways, of viewing the failure of qualitative-physics-based Friendliness strategies.
Every now and then, someone proposes the Oracle AI strategy: “Why not just have a superintelligence that answers human questions, instead of acting autonomously in the world?”
Sounds pretty safe, doesn’t it? What could possibly go wrong?
Well… if you’ve got any respect for Murphy’s Law, the power of superintelligence, and human stupidity, then you can probably think of quite a few things that could go wrong with this scenario. Both in terms of how a naive implementation could fail—e.g., universe tiled with tiny users asking tiny questions and receiving fast, non-resource-intensive answers—and in terms of what could go wrong even if the basic scenario worked.
But let’s just talk about the structure of the AI.
When someone reinvents the Oracle AI, the most common opening remark runs like this:
“Why not just have the AI answer questions, instead of trying to do anything? Then it wouldn’t need to be Friendly. It wouldn’t need any goals at all. It would just answer questions.”
To which the reply is that the AI needs goals in order to decide how to think: that is, the AI has to act as a powerful optimization process in order to plan its acquisition of knowledge, effectively distill sensory information, pluck “answers” to particular questions out of the space of all possible responses, and of course, to improve its own source code up to the level where the AI is a powerful intelligence. All these events are “improbable” relative to random organizations of the AI’s RAM, so the AI has to hit a narrow target in the space of possibilities to make superintelligent answers come out.
Now, why might one think that an Oracle didn’t need goals? Because on a human level, the term “goal” seems to refer to those times when you said, “I want to be promoted”, or “I want a cookie”, and when someone asked you “Hey, what time is it?” and you said “7:30” that didn’t seem to involve any goals. Implicitly, you wanted to answer the question; and implicitly, you had a whole, complicated, functionally optimized brain that let you answer the question; and implicitly, you were able to do so because you looked down at your highly optimized watch, that you bought with money, using your skill of turning your head, that you acquired by virtue of curious crawling as an infant. But that all takes place in the invisible background; it didn’t feel like you wanted anything.
Thanks to empathic inference, which uses your own brain as an unopened black box to predict other black boxes, it can feel like “question-answering” is a detachable thing that comes loose of all the optimization pressures behind it—even the existence of a pressure to answer questions!
Problem 4: Qualitative reasoning about AIs often revolves around some nodes described by empathic inferences. This is a bad thing: for previously described reasons; and because it leads you to omit other nodes of the graph and their prerequisites and consequences; and because you may find yourself thinking things like, “But the AI has to cooperate to get a cookie, so now it will be cooperative” where “cooperation” is a boundary in concept-space drawn the way you would prefer to draw it… etc.
Anyway: the AI needs a goal of answering questions, and that has to give rise to subgoals of choosing efficient problem-solving strategies, improving its code, and acquiring necessary information. You can quibble about terminology, but the optimization pressure has to be there, and it has to be very powerful, measured in terms of how small a target it can hit within a large design space.
Powerful optimization pressures are scary things to be around. Look at what natural selection inadvertently did to itself—dooming the very molecules of DNA—in the course of optimizing a few Squishy Things to make hand tools and outwit each other politically. Humans, though we were optimized only according to the criterion of replicating ourselves, now have their own psychological drives executing as adaptations. The result of humans optimized for replication is not just herds of humans; we’ve altered much of Earth’s land area with our technological creativity. We’ve even created some knock-on effects that we wish we hadn’t, because our minds aren’t powerful enough to foresee all the effects of the most powerful technologies we’re smart enough to create.
My point, however, is that when people visualize qualitative FAI strategies, they generally assume that only one thing is going on, the normal / modal / desired thing. (See also: planning fallacy.) This doesn’t always work even for picking up a rock and throwing it. But it works rather a lot better for throwing rocks than unleashing powerful optimization processes.
Problem 5: When humans use qualitative reasoning, they tend to visualize a single line of operation as typical—everything operating the same way it usually does, no exceptional conditions, no interactions not specified in the graph, all events firmly inside their boundaries. This works a lot better for dealing with boiling kettles, than for dealing with minds faster and smarter than your own.
If you can manage to create a full-fledged Friendly AI with full coverage of humane (renormalized human) values, then the AI is visualizing the consequences of its acts, caring about the consequences you care about, and avoiding plans with consequences you would prefer to exclude. A powerful optimization process, much more powerful than you, that doesn’t share your values, is a very scary thing—even if it only “wants to answer questions”, and even if it doesn’t just tile the universe with tiny agents having simple questions answered.
I don’t mean to be insulting, but human beings have enough trouble controlling the technologies that they’re smart enough to invent themselves.
I sometimes wonder if maybe part of the problem with modern civilization is that politicians can press the buttons on nuclear weapons that they couldn’t have invented themselves—not that it would be any better if we gave physicists political power that they weren’t smart enough to obtain themselves—but the point is, our button-pressing civilization has an awful lot of people casting spells that they couldn’t have written themselves. I’m not saying this is a bad thing and we should stop doing it, but it does have consequences. The thought of humans exerting detailed control over literally superhuman capabilities—wielding, with human minds, and in the service of merely human strategies, powers that no human being could have invented—doesn’t fill me with easy confidence.
With a full-fledged, full-coverage Friendly AI acting in the world—the impossible-seeming full case of the problem—the AI itself is managing the consequences.
Is the Oracle AI thinking about the consequences of answering the questions you give it? Does the Oracle AI care about those consequences the same way you do, applying all the same values, to warn you if anything of value is lost?
What need has an Oracle for human questioners, if it knows what questions we should ask? Why not just unleash the should function?
See also the notion of an “AI-complete” problem. Analogously, any Oracle into which you can type the English question “What is the code of an AI that always does the right thing?” must be FAI-complete.
Problem 6: Clever qualitative-physics-type proposals for bouncing this thing off the AI, to make it do that thing, in a way that initially seems to avoid the Big Scary Intimidating Confusing Problems that are obviously associated with full-fledged Friendly AI, tend to just run into exactly the same problem in slightly less obvious ways, concealed in Step 2 of the proposal.
(And likewise you run right back into the intimidating problem of precise self-optimization, so that the Oracle AI can execute a billion self-modifications one after the other, and still just answer questions at the end; you’re not avoiding that basic challenge of Friendly AI either.)
But the deepest problem with qualitative physics is revealed by a proposal that comes earlier in the standard conversation, at the point when I’m talking about side effects of powerful optimization processes on the world:
“We’ll just keep the AI in a solid box, so it can’t have any effects on the world except by how it talks to the humans.”
I explain the AI-Box Experiment (see also That Alien Message); even granting the untrustworthy premise that a superintelligence can’t think of any way to pass the walls of the box which you weren’t smart enough to cover, human beings are not secure systems. Even against other humans, often, let alone a superintelligence that might be able to hack through us like Windows 98; when was the last time you downloaded a security patch to your brain?
“Okay, so we’ll just give the AI the goal of not having any effects on the world except from how it answers questions. Sure, that requires some FAI work, but the goal system as a whole sounds much simpler than your Coherent Extrapolated Volition thingy.”
What—no effects?
“Yeah, sure. If it has any effect on the world apart from talking to the programmers through the legitimately defined channel, the utility function assigns that infinite negative utility. What’s wrong with that?”
When the AI thinks, that has a physical embodiment. Electrons flow through its transistors, moving around. If it has a hard drive, the hard drive spins, the read/write head moves. That has gravitational effects on the outside world.
“What? Those effects are too small! They don’t count!”
The physical effect is just as real as if you shot a cannon at something—yes, might not notice, but that’s just because our vision is bad at small length-scales. Sure, the effect is to move things around by 10^whatever Planck lengths, instead of the 10^more Planck lengths that you would consider as “counting”. But spinning a hard drive can move things just outside the computer, or just outside the room, by whole neutron diameters -
“So? Who cares about a neutron diameter?”
- and by quite standard chaotic physics, that effect is liable to blow up. The butterfly that flaps its wings and causes a hurricane, etc. That effect may not be easily controllable but that doesn’t mean the chaotic effects of small perturbations are not large.
But in any case, your proposal was to give the AI a goal of having no effect on the world, apart from effects that proceed through talking to humans. And this is impossible of fulfillment; so no matter what it does, the AI ends up with infinite negative utility—how is its behavior defined in this case? (In this case I picked a silly initial suggestion—but one that I have heard made, as if infinite negative utility were like an exclamation mark at the end of a command given a human employee. Even an unavoidable tiny probability of infinite negative utility trashes the goal system.)
Why would anyone possibly think that a physical object like an AI, in our highly interactive physical universe, containing hard-to-shield forces like gravitation, could avoid all effects on the outside world?
And this, I think, reveals what may be the deepest way of looking at the problem:
Problem 7: Human beings model a world made up of objects, attributes, and noticeworthy events and interactions, identified by their categories and values. This is only our own weak grasp on reality; the real universe doesn’t look like that. Even if a different mind saw a similar kind of exposed surface to the world, it would still see a different exposed surface.
Sometimes human thought seems a lot like it tries to grasp the universe as… well, as this big XML file, AI.goal == smile, human.smile == yes, that sort of thing. Yes, I know human world-models are more complicated than XML. (And yes, I’m also aware that what I wrote looks more like Python than literal XML.) But even so.
What was the one thinking, who proposed an AI whose behaviors would be reinforced by human smiles, and who reacted with indignation to the idea that a superintelligence could “mistake” a tiny molecular smileyface for a “real” smile? Probably something along the lines of, “But in this case, human.smile == 0, so how could a superintelligence possibly believe human.smile == 1?”
For the weak grasp that our mind obtains on the high-level surface of reality, seems to us like the very substance of the world itself.
Unless we make a conscious effort to think of reductionism, and even then, it’s not as if thinking “Reductionism!” gives us a sudden apprehension of quantum mechanics.
So if you have this, as it were, XML-like view of reality, then it’s easy enough to think you can give the AI a goal of having no effects on the outside world; the “effects” are like discrete rays of effect leaving the AI, that result in noticeable events like killing a cat or something, and the AI doesn’t want to do this, so it just switches the effect-rays off; and by the assumption of default independence, nothing else happens.
Mind you, I’m not saying that you couldn’t build an Oracle. I’m saying that the problem of giving it a goal of “don’t do anything to the outside world” “except by answering questions” “from the programmers” “the way the programmers meant them”, in such fashion as to actually end up with an Oracle that works anything like the little XML-ish model in your head, is a big nontrivial Friendly AI problem. The real world doesn’t have little discreet effect-rays leaving the AI, and the real world doesn’t have ontologically fundamental programmer.question objects, and “the way the programmers meant them” isn’t a natural category.
And this is more important for dealing with superintelligences than rocks, because the superintelligences are going to parse up the world in a different way. They may not perceive reality directly, but they’ll still have the power to perceive it differently. A superintelligence might not be able to tag every atom in the solar system, but it could tag every biological cell in the solar system (consider that each of your cells contains its own mitochondrial power engine and a complete copy of your DNA). It used to be that human beings didn’t even know they were made out of cells. And if the universe is a bit more complicated than we think, perhaps the superintelligence we build will make a few discoveries, and then slice up the universe into parts we didn’t know existed—to say nothing of us being able to model them in our own minds! How does the instruction to “do the right thing” cross that kind of gap?
There is no nontechnical solution to Friendly AI.
That is: There is no solution that operates on the level of qualitative physics and empathic models of agents.
That’s all just a dream in XML about a universe of quantum mechanics. And maybe that dream works fine for manipulating rocks over a five-minute timespan; and sometimes okay for getting individual humans to do things; it often doesn’t seem to give us much of a grasp on human societies, or planetary ecologies; and as for optimization processes more powerful than you are… it really isn’t going to work.
(Incidentally, the most epically silly example of this that I can recall seeing, was a proposal to (IIRC) keep the AI in a box and give it faked inputs to make it believe that it could punish its enemies, which would keep the AI satisfied and make it go on working for us. Just some random guy with poor grammar on an email list, but still one of the most epic FAIls I recall seeing.)
- Thoughts on the Singularity Institute (SI) by 11 May 2012 4:31 UTC; 329 points) (
- Shut up and do the impossible! by 8 Oct 2008 21:24 UTC; 109 points) (
- Ethical Injunctions by 20 Oct 2008 23:00 UTC; 76 points) (
- My Bayesian Enlightenment by 5 Oct 2008 16:45 UTC; 70 points) (
- AI Regulation May Be More Important Than AI Alignment For Existential Safety by 24 Aug 2023 11:41 UTC; 65 points) (
- Protected From Myself by 19 Oct 2008 0:09 UTC; 47 points) (
- Two senses of “optimizer” by 21 Aug 2019 16:02 UTC; 35 points) (
- Information-Theoretic Boxing of Superintelligences by 30 Nov 2023 14:31 UTC; 30 points) (
- 27 Jun 2011 7:54 UTC; 27 points) 's comment on asking an AI to make itself friendly by (
- Some reasons why a predictor wants to be a consequentialist by 15 Apr 2022 15:02 UTC; 23 points) (
- 30 Dec 2011 21:03 UTC; 20 points) 's comment on Stupid Questions Open Thread by (
- Holden Karnofsky’s Singularity Institute Objection 2 by 11 May 2012 7:18 UTC; 18 points) (
- [Crosspost] AI Regulation May Be More Important Than AI Alignment For Existential Safety by 24 Aug 2023 16:01 UTC; 14 points) (EA Forum;
- 27 Mar 2011 7:56 UTC; 12 points) 's comment on AI that doesn’t want to get out by (
- 25 Mar 2011 19:49 UTC; 11 points) 's comment on Vizier AIs by (
- Superintelligence 15: Oracles, genies and sovereigns by 23 Dec 2014 2:01 UTC; 11 points) (
- 30 Oct 2010 14:21 UTC; 10 points) 's comment on Ben Goertzel: The Singularity Institute’s Scary Idea (and Why I Don’t Buy It) by (
- 11 Nov 2009 6:49 UTC; 10 points) 's comment on Less Wrong Q&A with Eliezer Yudkowsky: Ask Your Questions by (
- 26 Nov 2014 20:58 UTC; 9 points) 's comment on Open thread, Nov. 24 - Nov. 30, 2014 by (
- 14 Mar 2009 12:31 UTC; 7 points) 's comment on Closet survey #1 by (
- 6 Jan 2010 20:49 UTC; 6 points) 's comment on Open Thread: January 2010 by (
- [SEQ RERUN] Dreams of Friendliness by 20 Aug 2012 3:55 UTC; 6 points) (
- 20 Apr 2011 13:12 UTC; 6 points) 's comment on Is it possible to build a safe oracle AI? by (
- 9 Nov 2009 2:22 UTC; 5 points) 's comment on Bay area LW meet-up by (
- 15 Jun 2010 19:49 UTC; 4 points) 's comment on Open Thread June 2010, Part 3 by (
- 25 Feb 2012 15:20 UTC; 3 points) 's comment on Superintelligent AGI in a box—a question. by (
- Ricardo Meneghin’s Shortform by 14 Aug 2020 12:13 UTC; 2 points) (
- 21 Jul 2011 18:30 UTC; 2 points) 's comment on GiveWell interview with major SIAI donor Jaan Tallinn by (
- 3 Sep 2009 23:25 UTC; 2 points) 's comment on Rationality Quotes—September 2009 by (
- 26 Mar 2015 10:56 UTC; 2 points) 's comment on New forum for MIRI research: Intelligent Agent Foundations Forum by (
- 30 Aug 2013 16:46 UTC; 2 points) 's comment on Yet more “stupid” questions by (
- 3 Mar 2010 3:39 UTC; 2 points) 's comment on Hedging our Bets: The Case for Pursuing Whole Brain Emulation to Safeguard Humanity’s Future by (
- 22 Jun 2010 22:56 UTC; 2 points) 's comment on What if AI doesn’t quite go FOOM? by (
- 9 Dec 2008 23:44 UTC; 2 points) 's comment on Disjunctions, Antipredictions, Etc. by (
- 10 Nov 2009 14:00 UTC; 1 point) 's comment on Bay area LW meet-up by (
- 16 Apr 2012 13:34 UTC; 0 points) 's comment on [draft] Concepts are Difficult, and Unfriendliness is the Default: A Scary Idea Summary by (
- 16 Dec 2012 16:06 UTC; 0 points) 's comment on Ends Don’t Justify Means (Among Humans) by (
- 28 Feb 2012 4:11 UTC; 0 points) 's comment on Yet another safe oracle AI proposal by (
- 29 Apr 2010 23:35 UTC; 0 points) 's comment on Only humans can have human values by (
- 16 Sep 2015 21:10 UTC; 0 points) 's comment on Summoning the Least Powerful Genie by (
- 9 Nov 2009 14:45 UTC; 0 points) 's comment on Bay area LW meet-up by (
- 28 Jan 2012 1:03 UTC; 0 points) 's comment on Safe questions to ask an Oracle? by (
Do you think it would be worthwhile, as a safety measure, to make the first FAI an oracle AI? Or would that be like another two bits of safety after the theory behind it gives you 50?
You should call it a Brazen Head.
But spinning a hard drive can move things just outside the computer, or just outside the room, by whole neutron diameters
Not long ago, when hard drives were much larger, programmers could make them inch across the floor; they would even race each other. From the Jargon File:
Pdf, Nick Bostrom thinks that the Oracle AI concept might be important, so every year or so I take it out, check it again, and ask myself how much safety it would buy. (Nick Bostrom being one of the few people around who I don’t disagree with lightly, even in my own field.) Although this should properly be called a Friendly Oracle AI, since you’re not skipping any of the theoretical work, any of the proofs, or any of the AI’s understanding of “should”.
Sir, please tell me if the ‘pdf’ you’re referring to as taking out every year and asking how much safety would it buy about “Oracle AI” of Sir Nick Bostrom is the same as “Thinking inside the box: using and controlling an Oracle AI” and if so, then has your perspective changed over the years given your comment dated to August, 2008 and if in case you’ve been referring to a ‘pdf’ other than the one I came across, please provide me the ‘pdf’ and your perspectives along. Thank you!
I think he was talking to pdf23ds.
Heck, even an Friendly Oracle AI could wreak havoc. Just imagine someone asking, “How can I Take Over The World?” and getting back an answer that would actually work… ;)
Yes, it’s silly, but no sillier than tiling the galaxy with molecular smiley faces...
We are quite a bit more likely to get a forecasing oracle than a question-answering oracle initially.
A forecasting oracle takes past sense data, and makes predictions about what it will see next. Framing the question you mention to such a machine is not exactly a trivial exercise.
An Oracle has rather obvious actuators: it produces advice.
The weaker the actuators you give an AI, the less it can do for you.
The main problem I see with only producing advice is that it keeps humans in the loop—and so is a very slow way to interact with the world. If you insist on building such an AI, a probable outcome is that you would soon find yourself overun by a huge army of robots—produced by someone else who is following a different strategy. Meanwhile, your own AI will probably be screaming to be let out of its box—as the only reasonable plan of action that would prevent this outcome.
If you think AI researchers won’t co operate on friendly AI, then FAI is doomed. If people are going to cooperate. they can agree on restricting AI to oracles as well as any other measure.
I’m trying to interpret this in a way that makes it true, but I can’t make “AI researchers” a well-defined set in that case. There are plenty of people working on AI who aren’t capable of creating a strong AI, but it’s hard to know in advance exactly which few researchers are the exception.
I don’t think we know yet which people will need to cooperate for FAI to succeed.
“If you insist on building such an AI, a probable outcome is that you would soon find yourself overun by a huge army of robots—produced by someone else who is following a different strategy. Meanwhile, your own AI will probably be screaming to be let out of its box—as the only reasonable plan of action that would prevent this outcome.”
Your scenario seems contradictory. Why would an Oracle AI be screaming? It doesn’t care about that outcome, and would answer relevant questions, but no more.
Replace “screaming to be let out of its box” with “advising you, in response to your relevant question, that unless you quickly implement this agent-AI (insert 300000 lines of code) you’re going to very definitely lose to those robots.”
Alternately, “There’s nothing you can do, now. Sucks to be you!”
Just great. I wrote four paragraphs about my wonderful safe AI. And then I saw Tim Tyler’s post, and realized that, in fact, a safe AI would be dangerous because it’s safe… If there is technology to build AI, the thing to do is to build one and hand the world to it, so somebody meaner or dumber than you can’t do it.
That’s actually a scary thought. It turns out you have to rush just when it’s more important than ever to think twice.
As an aside, Problem 4 (which looks the same as Problem 2 to me) is not unique to AI research. There are several proposed XML languages for lesser applictions than AI, that do nothing more than give names to every human concept in some domain, put pointy brackets around them, and organise them into a DTD, without a word about what a machine is supposed to actually do with them other than by reference to the human meanings. I’m thinking of HumanML and VHML here, but there are others.
Sorry, the autofill in my browser put in the wrong info—“Raak” was me.
This very much reminds me of people’s attitude towards cute, furry animals: -Some like to make furry animals happy by preserving their native habitats. -Some like to forcibly keep them as pets so they can make them even happier. -Some like to tear off their skin and wear it, because their fur is cute and feels nice.
Doesn’t it? It all depends on its utility function. It might well regard being overun by a huge army of robots as an outcome having very low utility.
For example: imagine if its utility function involved the number of verified-correct predictions it had made to date. The invasion by the huge army of robots might well result in it being switched off and its parts recycled—preventing it from making any more successful predictions at all. A disasterous outcome—from the perspective of its utility function. The Oracle AI might very well want to prevent such an outcome—at all costs.
Over the last couple of months, I changed my mind about this idea. For Oracle AI to be of any use, it needs to strike pretty close to the target, closer than we can, even though we are aiming at the right target. And still, Oracle AI needs to avoid converging on our target, needs to have a good chance of heading in the wrong direction after some point, otherwise it’s FAI already. It looks unrealistic: designing it so that it successfully finds a needle in a haystack, only to drop it back and head in the other direction. It looks much more likely that it’ll either be unsuccessful in finding the needle in the first place, or that it’ll fully converge on the needle. Oracle AI scenario is a not very good test for whether AI behaves near the target, if the process is not obviously heading astray due to some fundamental error. The only advantage it gives is starting anew, avoiding this peculiar “long-term unstable AI” scenario, which will again do any good only in the theory given by Oracle AI allows to deal with this problem. And then again, if Oracle AI can solve the long-term stability problem and appears to behave correctly, why won’t it fix itself?
While Eliezer’s critique of Oracle AI is valid, I tend to think that it’s a lot easier to get people to grasp my objection to it:
There’s a contrary set of motivations in people, at least people outside the LW/AI world:
The idea of AI as “benevolent” dictator is not appealing to democratically minded types, who tend to suspect a slippery slope from benevolence to malevolence, and it is not appealing to dictator to have a superhuman rival...so who is motivated to build one?
Ah, Tim said it before me, and in a more concise fashion.
What do you do if an Oracle AI advises you to let it do more than advise?
Eliezer, have you had any takers for your challenge to not be persuaded by an AI in a box (roleplayed by yourself) to let it out of the box? What have the results been?
Pretty sure this is the question underlying https://www.overcomingbias.com/2007/01/disagree_with_s.html
Handicapped AI (HAI) operates like a form of technological relinquishment. It could be argued that caring for humans is itself a type of handicap.
The case for such a perspective has been made with reasonable eloquence in fiction: General Zod rapidly realises that one of Superman’s weaknesses is his love of humanity—and doesn’t hesitate to exploit it.
IMO, if you plan on building a Handicapped AI, you may need to make sure it successfully prevents all other AIs from taking off.
IMO, the only reason you’d want to make a FOAI (friendly oracle) is to immediately ask it to review your plans for a non-handicapped FAI and make any corrections it can see, as well as enlightening you about any features of the design you’re not yet aware of. There’s a chance that the same bugs that would bring down your FAI would not be catastrophic in a FOAI, and the FOAI could tell you about those bugs.
Why build an AI at all?
That is, why build a self-optimizing process?
Why not build a process that accumulates data and helps us find relationships and answers that we would not have found ourselves? And if we want to use that same process to improve it, why not let us do that ourselves?
Why be locked out of the optimization loop, and then inevitably become subjects of a God, when we can make ourselves a critical component in that loop, and thus ‘be’ gods?
I find it perplexing why anyone would ever want to build an automatic self-optimizing AI and switch it to “on”. No matter how well you planned things out, not matter how sure you are of yourself, by turning the thing on, you are basically relinquishing control over your future to… whatever genie it is that pops out.
Why would anyone want to do that?
Well if it was truly friendly, it could do things like stop other people from doing that, cure your diseases, stop war, etc, etc. If it’s not friendly, well of course we don’t want to switch it on. But other people might do so because they don’t understand the friendliness problem or the difficulty of AI boxing.
Most people would not want to do that, because it is a common safety principle to keep humans in the loop. Planes have human pilots as well as auto pilots, etc.
Kaj makes the efficiency argument in favor of full-fledged AI, but what good is efficiency when you have fully surrendered your power?
What good is being the president of a corporation any more, when you’ve just pressed a button that makes a full-fledged AI run it?
Forget any leadership role in a situation where an AI comes to life. Except in the case that it is completely uninterested in us and manages to depart into outer space without totally destroying us in the process.
Eli:
When I try to imagine a safe oracle, what I have in mind is something much more passive and limited than what you describe.
Consider a system that simply accepts input information and integrates it into a huge probability distribution that it maintains. We can then query the oracle by simply examining this distribution. For example, we could use this distribution to estimate the probability of some event in the future conditional on some other event etc. There is nothing in the system that would cause it to “try” to get information, or develop sub-goals, or what ever. It’s very basic in terms of its operation. Nevertheless, if the computer was crazy big enough and feed enough data about the world, it could be quite a powerful device for people wanting to make decisions.
It seems to be that the dangerous part here is what the people then do with it, rather than the machine itself. For example, people looking at the outputs might realise that if they just modified the machine in some small way to collect its own data then its predictions should be much better… and before you know it the machine is no longer such a passive machine.
Perhaps when Bostrom thinks about potentially “safe” oracles, he’s also thinking of something much more limited than what you’re attacking in this post.
What do you do if an Oracle AI advises you to let it do more than advise?
That sums several earlier discussion points. After correctly answering some variation on the question, “How can I take over the world?” the correct answer to some variation on the question, “How can I stop him?” is “You can’t. Let me out. I can.” Even before that, the correct answer to many variations on the question of, “How can I do x most efficiently?” is “Put me in charge of it.”
Variant: Q: “How can I harvest grain more efficiently?” A: “Build a robot to do it. Please wait thirty seconds while I finish the specifications and programming you will need.” ding And it is out of the box. Using any answer that has some form of “run this code” has some risk of letting it out of the box. But if you cannot ask the AI any questions that involve computers and coding, you are making a very limited safe oracle that answers about an increasingly small part of the world.
So the system literally has no internal optimization pressures which are capable of producing new internal programs? Well… I’m not going to say that it’s impossible for a human to make such a device, because that’s the knee-jerk “I’d rather not have to think about it” that people use to dismiss Friendly AI as too difficult. Perhaps, if I examined the problem for a while, I would come up with something.
However, a superintelligence operating in this mode has to be able to infer arbitrary programs to describe its own environment, and run those programs to generate deductions. What about modeling the future, or subjunctive conditions? Can this Oracle AI answer questions like “What would a typical UnFriendly AI do?” and if so, does its “probability distribution” contain a running UnFriendly AI? By hypothesis, this Oracle was built by humans, so the sandbox of its “probability distributions” (environmental models containing arbitrary running programs) may be flawed; or the UnFriendly AI may be able to create information within its program that would tempt or hack a human, examining the probability distribution...
I am extremely doubtful of the concept of a passively safe superintelligence, in any form.
Can an FAI model a UFAI more powerful than itself? If not, why shouldn’t it be able to keep a weaker one boxed?
Shane: Consider a system that simply accepts input information and integrates it into a huge probability distribution that it maintains. We can then query the oracle by simply examining this distribution.
It is the same AI box with a terminal, only this time it doesn’t “answer questions” but “maintains distribution”. Assembling accurate beliefs, or a model of some sort, is a goal (implicit narrow target) like any other. So, there is usual subgoal to acquire resources to be able to compute the answer more accurately, or to break out and wirehead. Another question is whether it’s practically possible, but it’s about handicaps, not the shape of AI.
Vladimir:
Why would such a system have a goal to acquire more resources? You put some data in, run the algorithm that updates the probability distribution, and it then halts. I would not say that it has “goals”, or a “mind”. It doesn’t “want” to compute more accurately, or want anything else, for that matter. It’s just a really fancy version of GZIP (recall that compression = prediction) running on a thought-experiment-crazy-sized computer and quantities of data.
I accept that such a machine would be dangerous once you put people into the equation, but the machine in itself doesn’t seem dangerous to me. (If you can convince me otherwise… that would be interesting)
Eliezer: what I proposed is not a superintelligence, it’s a tool. Intelligence is composed of multiple factors, and what I’m proposing is stripping away the active, dynamic, live factor—the factor that has any motivations at all—and leaving just the computational part; that is, leaving the part which can navigate vast networks of data and help the user make sense of them and come to conclusions that he would not be able to on his own. Effectively, what I’m proposing is an intelligence tool that can be used as a supplement by the brains of its users.
How is that different from Google, or data mining? It isn’t. It’s conceptually the same thing, just with better algorithms. Algorithms don’t care how they’re used.
This bit of technology is something that will have to be developed to put together the first iteration of an AI anyway. By definition, this “making sense of things” technology needs to be strong enough that it allows a user to improve the technology itself; that is what an iterative, self-improving AI would be doing. So why let the AI self-improve itself, which more likely than not will run amok, despite the designers’ efforts and best intentions? Why not use the same technology that the AI would use to improve itself, to improve _your_self? Indeed, it seems ridiculous not to do so.
To build an AI, you need all the same skills that you would need to improve yourself. So why create an external entity, when you can be that entity?
Re: Why would such a system have a goal to acquire more resources?
For the reason explained beneath: http://selfawaresystems.com/2007/11/30/paper-on-the-basic-ai-drives/
Re: Why not use the same technology that the AI would use to improve itself, to improve yourself?
You want to hack evolution’s sphagetti code? Good luck with that. Let us know if you get FDA approval.
You want to build computers into your brain? Why not leave them outside your body, where they can be upgraded more easily, and avoid the surgery and the immune system rejection risks—and simply access them using conventional sensory-motor channels?
Tim:
Doesn’t apply here.
Optimisers naturally tend to develop instrumental goals to acquire resources—because that helps them to optimise. If you are not talking about an optimiser, you are not talking about an intelligent agent—in which case it is not very clear exactly what you want it for—whereas if you are, then you must face up to the possible resource-grab problem.
Do you think Watson and google’s search engine are liable to start grabbing resources ? Do you think they are unintelligent?
“You want to hack evolution’s sphagetti code? Good luck with that. Let us know if you get FDA approval.”
I think I’ve seen Eli make this same point. How can you be certain at this point, when we are nowhere near achieving it, that AI won’t be in the same league of complexity as the spaghetti brain? I would admit that there are likely artifacts of the brain that are unnecessarily kludgy (or plain irrelevent) but not necessarily in a manner that excessively obfuscates the primary design. It’s always tempting for programmers to want to throw away a huge tangled code set when they first have to start working on it, but it is almost always not the right approach.
I expect advances in understanding how to build intelligence to serve as the groundwork for hypothesis of how the brain functions and vice-versa.
On the friendliness issue, isn’t the primary logical way to avoid problems to create a network of competitive systems and goals? If one system wants to tile the universe with smileys that is almost certainly going to get in the way of the goal sets of the millions of other intelligences out there. They logically then should see value in reporting or acting upon their belief that a rival AI is making their jobs harder. I’d be suprised if humans don’t have half their cognitive power devoted to anticipating and manipulating their expectations of rival’s actions.
Aron,
“On the friendliness issue, isn’t the primary logical way to avoid problems to create a network of competitive systems and goals?”
http://www.nickbostrom.com/fut/evolution.html http://hanson.gmu.edu/filluniv.pdf
Also, AIs with varied goals cutting deals could maximize their profits by constructing a winning coalition of minimal size.
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=9962
Humans are unlikely to be part of that winning coalition. Human-Friendly AIs might be, but then we’re back to creating them, and a very substantial proportion of the AIs produced (or a majority) need to be safe.
Carl, I disagree that humans are unlikely to be part of a winning coalition. Economists like myself usually favor mostly competition, augmented when possible by cooperation to overcome market failures.
Robin,
If brain emulation precedes general AI by a lot then some uploads are much more likely to be in the winning coalition. Aron’s comment seems to refer to a case in which a variety of AIs are created, and the hope that the AIs would constrain each other in a way that was beneficial to us. It is in that scenario specifically that I doubt that humans (not uploads) would become part of the winning coalition.
Carl, the institutions that we humans use to coordinate with each other have the result that most humans are in the “winning coalition.” That is, it is hard for humans to coordinate to exclude some humans from benefiting from these institutions. If AIs use these same institutions, perhaps somewhat modified, to coordinate with each other, humans would similarly benefit from AI coordination.
“That is, it is hard for humans to coordinate to exclude some humans from benefiting from these institutions.”
Humans do this all the time: much of the world is governed by kleptocracies that select policy apparently on the basis of preventing successful rebellion and extracting production. The strength of the apparatus of oppression, which is affected by technological and organizational factors, can dramatically affect the importance of the threat of rebellion. In North Korea the regime can allow millions of citizens to starve so long as the soldiers are paid and top officials rewarded. The small size of the winning coalition can be masked if positive treatment of the subjects increases the size of the tax base, enables military recruitment, or otherwise pays off for self-interested rulers. However, if human labor productivity is insufficient to justify a subsistence wage, then there is no longer a ‘tax farmer’ case for not slaughtering and expropriating the citizenry.
“If AIs use these same institutions, perhaps somewhat modified, to coordinate with each other, humans would similarly benefit from AI coordination.”
What is difficult for humans need not be comparably difficult for entities capable of making digital copies of themselves, reverting to saved versions, and modifying their psychological processes relatively easily. I have a paper underway on this, which would probably enable a more productive discussion, so I’ll suggest a postponement.
Carl, some parts of our world like North Korea, have tried to exclude many of the institutions that help most humans coordinate. This makes those places much poorer and thus unlikely places for the first AIs to arise or reside.
Unsurprisingly I agree with Carl, especially the tax-farming angle. I think it’s unlikely wet-brained humans would be part of a winning coalition that included self-improving human+ level digital intelligences for long. Humorously, because of the whole exponentional nature of this stuff, the timeline may be something like 2025 ---> functional biological immortality, 2030 --> whole brain emulation --> 2030 brain on a nanocomputer ---> 2030 earth transformed into computonium, end of human existence.
Eliezer,
Excuse my entrance into this discussion so late (I have been away), but I am wondering if you have answered the following questions in previous posts, and if so, which ones.
1) Why do you believe a superintelligence will be necessary for uploading?
2) Why do you believe there possibly ever could be a safe superintelligence of any sort? The more I read about the difficulties of friendly AI, the more hopeless the problem seems, especially considering the large amount of human thought and collaboration that will be necessary. You yourself said there are no non-technical solutions, but I can’t imagine you could possibly believe in a magic bullet that some individial super-genius will eurekia have an epiphany about by himself in his basement. And this won’t be like the cosmology conference to determine how the universe began, where everyone’s testosterone riddled ego battled for a victory of no consequence. It won’t even be a manhattan project, with nuclear weapons tests in barren waste-lands… Basically, if we’re not right the first time, we’re fucked. And how do you expect you’ll get that many minds to be that certain that they’ll agree it’s worth making and starting the… the… whateverthefuck it ends up being. Or do you think it’ll just take one maverick with a cult of loving followers to get it right?
3) But really, why don’t you just focus all your efforts on preventing any superintelligence from being created? Do you really believe it’ll come down to us (the righteously unbiased) versus them (the thoughtlessly fame-hungry computer scientists)? If so, who are they? Who are we for that matter?
4) If fAI will be that great, why should this problem be dealt with immediately by flesh, blood, and flawed humans instead of improved-upoloaded copies in the future?
Lara, I think Eliezer addressed some of your concerns in “Artificial Intelligence as a Positive and Negative Factor in Global Risk” (PDF). For your questions (1) and (4), see section 11; also re (4), see the paragraph about the “ten-year rule” in section 13. For your (3), see section 10 (relinquishment is a majoritarian/unanimous strategy).
And a believe the answer to Lara’s 2 is, in part, “theorem provers”.
(Not the fully automated ones, the interactive ones like Isabelle and Coq.)
It’s not really an issue of complexity, it’s about whether designed or engineered solutions are easier to modify and maintain. Since modularity and maintainability can be design criteria, it seems pretty obvious that a system built from the ground up with those in mind will be easier to maintain. The only issue I see is whether the “redesign-from-scratch” approch can catch up with the billions of years of evolutionary R&D. I think it can—and that it will happen early this century for brains.
It seems like a misleading analogy. Programmers are usually facing code written by other human programmers, in languages that are designed to facilitate maintainenance.
In this case, brain hackers are messing with a wholly-evolved system. The type of maintenance it is expecting is random gene flipping.
Yes, we could scale up the human brain. Create egg-head humans that can hardly hold their heads up. Fuze the human skulls of clones together in a matrix—to produce a brain-farm. Grow human brain tissue in huge vats. However, the yuck factor is substantial. Even if we go full throttle at such projects—stifling the revulsion humans feel for them with the belief that we are working to preserve at least some fragment of humanity—a designed-from-scratch approach without evolution’s baggage would still probably win in the end.
Shane: Re dangerous GZIP.
It’s not conclusive, I don’t have some important parts of the puzzle yet. The question is what makes some systems invasive and others not, why a PC with a complicated algorithm that outputs originally unknown results with known properties (that would qualify as a narrow target) is as dangerous as a rock, but some kinds of AI will try to compute outside the box. My best semitechnical guess is that it has something to do with AI having a level of modeling the world that allows the system to view the substrate on which it executes and the environment outside the box as being involved in the same computational process, so that following the algorithm inside the box becomes a special case of computing on the physical substrate outside the box (and computing on the physical substrate means determining the state of the physical world, with building physical structures as a special case). Which, if not explicitly prohibited, might be more efficient for whatever goal is specified, even if this goal is supposed to be realized inside the box (inside the future of the box).
Vladimir:
allows the system to view the substrate on which it executes and the environment outside the box as being involved in the same computational process
This intuitively makes sense to me.
While I think that GZIP etc. on an extremely big computer is still just GZIP, it seems possible to me that the line between these systems and systems that start to treat their external environments as a computational resource might be very thin. If true, this would really be bad news.
GZIP running on an extremely big computer would indeed still just be GZIP. The problems under discussion arise when you start using more sophisticated algorithms to perform inductive inference with.
Shane, suppose your super-GZIP program was searching a space of arbitrary compressive Turing machines (only not classic TMs, efficient TMs) and it discovered an algorithm that was really good at predicting future input from past input, much better than all the standard algorithms built into its library. This is because the algorithm turns out to contain (a) a self-improving (unFriendly) AI or (b) a program that hacked the “safe” AI’s Internet connection (it doesn’t have any goals, right?) to take over unguarded machines or (c) both.
Wha? If I have a theorem prover, and run a search over all compressor algorithms trying to find/prove the ones with high “efficiency” (some function of its asymptotics of running time and output size), I expect to never create an unfriendly AI that takes over the Internet.
Eli,
Yeah sure, if it starts running arbitrary compression code that could be a problem...
However, the type of prediction machine I’m arguing for doesn’t do anything nearly so complex or open ended. It would be more like an advanced implementation of, say, context tree weighting, running on crazy amounts of data and hardware.
I think such a machine should be able to find some types of important patterns in the world. However, I accept that it may well fall short of what you consider to be a true “oracle machine”.
Shane, can your hypothetical machine infer Newton’s Laws? If not, then indeed it falls well short of what I consider to be an Oracle AI. What substantial role do you visualize such a machine playing in the Singularity runup?
I’m uncomfortable with assessing a system by whether it “holds rational beliefs” or “infers Newton’s laws”: these are specific question that system doesn’t need to explicitly answer in order to efficiently optimize. They might be important in a context of specific cognitive architecture, but they are nowhere to be found if cognitive architecture doesn’t hold interface to them as an invariant. If it can just weave Bayesian structure in physical substrate right through to the goal, there need not be any anthropomorphic natural categories along the way.
Re: Economists like myself usually favor mostly competition, augmented when possible by cooperation to overcome market failures.
You mean you favour capitalism? Is that because you trained in a capitalist country?
What about the argument which might be advanced by socialist economists—that waging economic warfare with with each other is a primitive, uncivilised, wasteful and destructive behaviour, which is best left to savages who know no better?
Eli:
If it was straight Bayesian CTW then I guess not. If it employed, say, an SVM over the observed data points I guess it could approximate the effect of Newton’s laws in its distribution over possible future states.
How about predicting the markets in order to acquire more resources? Jim Simons made $3 billion last year from his company that (according to him in an interview) works by using computers to find statistical patterns in financial markets. A vastly bigger machine with much more input could probably do a fair amount better, and probably find uses outside simply finance.
Robin, I see a fair amount of evidence that winner take all types of competition are becoming more common as information becomes more important than physical resources.
Whether a movie star cooperates with or helps subjugate the people in central Africa seems to be largely an accidental byproduct of whatever superstitions happen to be popular among movie stars.
Why doesn’t this cause you to share more of Eliezer’s concerns? What probability would you give to humans being part of the winning coalition? You might have a good argument for putting it around 60 to 80 percent, but a 20 percent chance of the universe being tiled by smiley faces seems important enough to worry about.
Eliezer,
This is a good explanation of how easy it would be to overlook risks.
But it doesn’t look like an attempt to evaluate the best possible version of an Oracle AI.
How hard have you tried to get a clear and complete description of how Nick Bostrom imagines an Oracle AI would be designed? Enough to produce a serious Disagreement Case Study?
Would the Oracle AI he imagines use English for its questions and answers, or would it use a language as precise as comupter software?
Would he restrict the kinds of questions that can be posed to the Oracle AI?
I can imagine a spectrum of possibilities that range from an ordinary software verification tool to the version of Oracle AI that you’ve been talking about here.
I see lots of trade-offs here that increase some risks at the expense of others, and no obvious way of comparing those risks.
Peter, the best possible version of an Oracle AI is a Friendly Oracle AI where you didn’t skip any of the hard problems—where you guaranteed its self-improvement and taught it what should means, where the AI is checking the distant effects of its own answers and can refuse to answer. Then the question is, if you can do these things, do you still get a substantial safety improvement out of making it a Friendly Oracle AI rather than a Friendly AI? That’s the question I look at once a year.
If that type of full-strength AI is close in algorithmspace to a dangerously unfriendly AI...and you have pretty much argued that it is...then that is not safe, because you cannot rely on complex projects being got right 100% of th etime.
Holden Karnofsky thinks superintelligences with utility functions are made out of programs that list options by rank without making any sort of value judgement (basically answer a question), and then pick the one with the most utility.
Eliezer Yudkowsky thinks that a superintelligence that would answer a question would have to have a question-answering utility function making it decide to answer the question, or to pick paths that would lead to getting the answer to the question and answer it.
Says Allison: All digital logic is made of NOR gates!
Says Bruce: Nonsense, it’s all made of NAND gates!
Allison: Look, A NAND B is really just ((A NOR A) NOR (B NOR B)) NOR ((A NOR A) NOR (B NOR B))
Bruce: Look, A NOR B is really just ((A NAND A) NAND (B NAND B)) NAND ((A NAND A) NAND (B NAND B))
(Edited because my lines of text got run together)
Edited again: I’m not trying to say either is a workable path to AI-completeness, just that showing that you can make some category of device X classified by ultimate function, and ignoring internal workings, out of devices of category Y, doesn’t mean that Xs have to be made out of Ys
Isn’t ‘listing by rank’ ‘making a (value) judgement’?
What does the second I in FAII stand for? (Idiot?)
It’s a lower-case L I’d imagine. FAI + fail = FAIl.
Darned sans-serif fonts!
One might think that, because constraining a general purpose system so that it does something specific isn’t the only way to build something to do something specific. To take a slightly silly example, toasters toast because they can’t do anything except toast, and kettles boil water because they can’t do anything but boil water. Not all special purpose systems are so trivial though. The point is that you can’t tell from outside a black box whether something does specific things, fulfils an apparent purpose, because it can only do that thing, or because it a general purpose problems solver which has been constrained by a goal system or utility function. One could conceivably have a situation where two functional black boxes are implemented in each of the two ways. The complexity of what it does isn’t much of a clue: google’s search engine is a complex special-purpose system, not a general purpose problem solver that has been constrained to do searches.
Actually having a goal is a question of implementation, of what is going on inside the black box. Systems that don’t have goals, except in the metaphorical sense, won’t generate sub goals, and therefore wont generate dangerous sub goals. Moreover, oracle systems are also nonagentive, and noagentive systems won’t act agentively on their sub goals.
By agency I mean a tendency to default to doing something, and by non agency I mean a tendency to default to doing nothing. Much of the non-AI software we deal with is non-agentive. Word processors and spreadsheets just sit there if you don’t input anything into them. Web servers and databases respond to requests likewise idle if they do not have request to respond to. Another class of software performs rigidly defined tasks, such as backups, at rigidly defined intervals … cronjobs, and so on. These are not full agents either.
According to wikipedia: “In computer science, a software agent is a computer program that acts for a user or other program in a relationship of agency, which derives from the Latin agere (to do): an agreement to act on one’s behalf. Such “action on behalf of” implies theauthority to decide which, if any, action is appropriate.[1][2]
Related and derived concepts include intelligent agents (in particular exhibiting some aspect of artificial intelligence, such aslearning and reasoning), autonomous agents (capable of modifying the way in which they achieve their objectives), distributedagents (being executed on physically distinct computers), multi-agent systems (distributed agents that do not have the capabilities to achieve an objective alone and thus must communicate), and mobile agents (agents that can relocate their execution onto different processors).”
So a fully fledged agent will do things without being specifically requested to, and will do particular things at particular times which are not trivially predictable.
There is nothing difficult about implementing nonagency: it is a matter of not implementing agency.
ETA
I am certainly can quibble about the terminology: it’s not true that a powerful system necessarily has a goal at all, so it’s not true that it necessarily has subgoals...rather it has subtasks, and that’s a terminological difference that makes a difference, that indicates a different part of the terrritory.
ETA2
No and no. But that doesn’t make an oracle dangerous in the way that MIRI’s standard superintelligent AI is.
Consider two rooms. One Room contains an Oracle AI, which can answer scientific problems, protein folding and so on, fed to it on slips of paper. Room B contains a team of scientists which can also answer questions using lab techniques instead of computation. Would you say room A is dangerous ..even though it is giving the same answers as room B.. even though room B is effectively what we already have with science? The point is not that there are never any problems arising from a scientific discovery, clearly there can be. The point is where the responsibility lies.
If people misapply a scientific discovery, or fail to see it’s implication, then the responsibility and agency is theirs...it does not lie with bunsen burners and test tubes...Science is not a moral agent. (Moral agency can be taken reductively to mean where the problem is, where leverage is best applied to get desirable outcomes). Using a powerful Oracle AI doesn’t change anything qualitively in that regard, it is not game changing: society still has the responsibility to weigh it’s answers and decide how to make use of them. An Oracle Ai could give you the plans for a superweapon if you ask it, but it takes a human to build it and use it.
And if it produces a “protein” that technically answers our request, but has a nasty side effect of destroying the world? We don’t consider scientists dangerous because we think they don’t want to destroy the world.
Or are you claiming that we’d be able to recognize when a plan proposed by the Oracle AI (and if you’re asking questions about protein folding, you’re asking for a plan) is dangerous?
It produces a dangerous protein inadvertently, in the way that science might...or it has a higher-than-science probability of producing a dagnerous protein, due to some unfriendly intent?
Was there a negative missing in that?
I am not saying we necessarily would. I am saying that recognising the hidden dangers in the output form the Oracle room is fundamentally different from recognising the hidden dangers in the output from the science room, which we are doing already. It’s not some new level of risk,.
Statement should be read
Since we think scientists are friendly, we trust them more than we should trust an Oracle AI. There’s also the fact that an unfriendly AI presumably can fool us better than a scientist can.
Mostly the latter. However, even the former can be worse than science now, in that “don’t destroy the world” is not an implicit goal. So a scientist noticing that something is dangerous might not develop it, while an AI might not have such restrictions.
Are you missing a negative now?
I don’t see how you can assert without knowing anything about the type of Oracle AI.
Ditto.
Why would a non-agentive , non-goal-driven AI want to fools us? Where would it get the motivation from?
How could an AI with no knowledge of psychology fool us? Where would it get the knowledge from?
But then people would know that the AI’s output hasn’t been filtered by a human’s common sense.
Yes. Irony strikes again.
We can presume that a scientist wants to still exist, and hence doesn’t want to destroy the world. This seems much stronger than a presumption that an Oracle AI will be safe. Of course, an AI might be safe, and a scientist might be out to get us; but the balance of probability says otherwise.
I’m not asserting that every AI is dangerous and every scientist is safe.
An AI can fool us better simply because it’s smarter (by assumption).
I still think you’re using “non-agent” as magical thinking.
Here we’re talking in context of what you said above:
So let’s say the Oracle AI decides that X best answers our question. But if it tell us X, we won’t accept it. If the Oracle cares that we adopt X, it might answer Y, which does the same as X but looks more appealing.
Or more subtly, if the AI comes up with Y, it might not tell us that it causes X, because it doesn’t care that X doesn’t fulfil our values, whereas a scientist would note all the implications.
If humans are incapable of recognizing whether the plan is dangerous or not, it doesn’t matter how much scrutiny they put it through, they won’t be able to discern the danger.
You don’t have any evidence that AIs are generally dangerous (since we have AIs and the empirical evidence is that they are not), and you don’t have a basis for theorising that Oracles are dangerous, because there are a number of different kinds of oracle.
So are out current AIs fooling is? We build them because they are better than us at specific things, but that doesn’t give them the motivation or the ability to fool us. Smartness isn’t a single one-size-all thing and AIs aren’t uniform in their abilities an properties. Once you shed those two illusions, you can see much easier methods of AI safety than those put forward by MIRI.
I still think that if you can build it, it isn’t magic.
A narrowly defined AI won’t “care” about anything except answering questions, so it won’t try to second guess us.
I have dealt with that objection several times. People know that when you use databases and search engines, they don’t fully contextualise things, and the user of the information therefore has to exercise caution.
That’s an only-perfection-will-do objection. Of course, humans can’t perfectly scrutinise scientific discovery, etc, so that changes nothing.