Wow, I’m blown away by Holden Karnofsky, based on this post alone. His writing is eloquent, non-confrontational and rational. It shows that he spent a lot of time constructing mental models of his audience and anticipated its reaction. Additionally, his intelligence/ego ratio appears to be through the roof. He must have learned a lot since the infamous astroturfing incident. This is the (type of) person SI desperately needs to hire.
Emotions out of the way, it looks like the tool/agent distinction is the main theoretical issue. Fortunately, it is much easier than the general FAI one. Specifically, to test the SI assertion that, paraphrasing Arthur C. Clarke,
Any sufficiently advanced tool is indistinguishable from an agent.
one ought to formulate and prove this as a theorem, and present it for review and improvement to the domain experts (the domain being math and theoretical computer science). If such a proof is constructed, it can then be further examined and potentially tightened, giving new insights to the mission of averting the existential risk from intelligence explosion.
If such a proof cannot be found, this will lend further weight to the HK’s assertion that SI appears to be poorly qualified to address its core mission.
What exactly is the difference between a “tool” and an “agent”, if we taboo the words?
My definition would be that “agent” has their own goals / utility functions (speaking about human agents, those goals / utility functions are set by evolution), while “tool” has a goal / utility function set by someone else. This distinction may be reasonable on a human level, “human X optimizing for human X’s utility” versus “human X optimizing for human Y’s utility”, but on a machine level, what exactly is the difference between a “tool” that is ordered to reach a goal / optimize a utility function, and an “agent” programmed with the same goal / utility function?
Am I using a bad definition that misses something important? Or is there anything than prevents “agent” to be reduced to a “tool” (perhaps a misconstructed tool) of the forces that have created them? Or is it that all “agents” are “tools”, but not all “tools” are “agents”, because… why?
What exactly is the difference between a “tool” and an “agent”, if we taboo the words?
One definition of intelligence that I’ve seen thrown around on LessWrong is it’s the ability to figure out how to steer reality in specific directions given the resources available.
Both the tool and the agent are intelligent in the sense that, assuming they are given some sort of goal, they can formulate a plan on how to achieve that goal, but the agent will execute the plan, while the tool will report the plan.
I’m assuming for the sake of isolating the key difference, that for both the tool-AI and the agent-AI, they are “passively” waiting for instructions for a human before they spring into action. For an agent-AI, I might say “Take me to my house”, whereas for a tool AI, I would say “What’s the quickest route to get to my house?”, and as soon as I utter these words, suddenly the AI has a new utility function to use in evaluate any possible plan it comes up with.
Or is there anything than prevents “agent” to be reduced to a “tool” (perhaps a misconstructed tool) of the forces that have created them? Or is it that all “agents” are “tools”, but not all “tools” are “agents”, because… why?
Assuming it’s always possible to decouple “ability to come up with a plan” from both “execute the plan” and “display the plan”, then any “tool” can be converted to an “agent” by replacing every instance of “display the plan” to “execute the plan” and vice versa for converting an agent into a tool.
My understanding of the distinction made in the article was:
Both “agent” and “tool” are ways of interacting with a highly sophisticated optimization process, which takes a “goal” and applies knowledge to find ways of achieving that goal.
An agent then acts out the plan.
A tool reports the plan to a human (often in in a sophisticated way, including plan details, alternatives, etc.).
So, no, it has nothing to do with whether I’m optimizing “my own” utility vs someone else’s.
You divide planning from acting, as if those two are completely separate things. Problem is, in some situations they are not.
If you are speaking with someone, then the act of speach is acting. In this sense, even a “tool” is allowed to act. Now imagine a super-intelligent tool which is able to predict human’s reactions to its words, and make it a part of equation. Now the simple task of finding x such that cost(x) is the smallest, suddenly becomes a task of finding x and finding a proper way to report this x to human, such that cost(x) is the smallest. If this opens some creative new options, where the f(x) is smaller than it should usually be, for the super-intelligent “tool” it will be a correct solution.
So for example reporting a result which makes the human commit suicide, if as a side effect this will make the report true, and it will minimize f(x) beyond normally achievable bounds, is acceptable solution.
Example question: “How should I get rid of my disease most cheaply.” Example answer: “You won’t. You will die soon in terrible pains. This report is 99.999% reliable”. Predicted human reaction: becomes insane from horror, dedices to kill himself, does it clumsily, suffers from horrible pains, then dies. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.
To me, this is still in the spirit of an agent-type architecture. A tool-type architecture will tend to decouple the optimization of the answer given from the optimization of the way it is presented, so that the presentation does not maximize the truth of the statement.
However, I must admit that at this point I’m making a fairly conjunctive argument; IE, the more specific I get about tool/agent distinctions, the less credibility I can assign to the statement “almost all powerful AIs constructed in the near future will be tool-style systems”.
(But I still would maintain my assertion that you would have to specifically program this type of behavior if you wanted to get it.)
This is like the whole point of why LessWrong exists. To remind people that making a superintelligent tool and expecting it to magically gain human common sense is a fast way to extinction.
The superintelligent tool will care about suicide only if you program it to care about suicide. It will care about damage only if you program it to care about damage. -- If you only program it to care about answering correctly, it will answer correctly… and ignore suicide and damage as irrelevant.
If you ask your calculator how much is 2+2, the calculator answers 4 regardles of whether that answer will drive you to suicide or not. (In some contexts, it hypothetically could.) A superintelligent calculator will be able to answer more complex questions. But it will not magically start caring about things you did not program it to care about.
The “superintelligent tool” in the example you provided gave a blatantly incorrect answer by it’s own metric. If it counts suicide as a win, why did it say the disease would not be gotten rid of?
In the example the “win” could be defined as an answer which is: a) technically correct, b) relatively cheap among the technically correct answers.
This is (in my imagination) something that builders of the system could consider reasonable, if either they didn’t consider Friendliness or they believed that a “tool AI” which “only gives answers” is automatically safe.
The computer gives an answer which is technically correct (albeit a self-fulfilling prophecy) and cheap (in dollars spent for cure). For the computer, this answer is a “win”. Not because of the suicide—that part is completely irrelevant. But because of the technical correctness and cheapness.
It’s complicated. A reply that’s true enough and in the spirit of your original statement, is “Something going wrong with a sufficiently advanced AI that was intended as a ‘tool’ is mostly indistinguishable from something going wrong with a sufficiently advanced AI that was intended as an ‘agent’, because math-with-the-wrong-shape is math-with-the-wrong-shape no matter what sort of English labels like ‘tool’ or ‘agent’ you slap on it, and despite how it looks from outside using English, correctly shaping math for a ‘tool’ isn’t much easier even if it “sounds safer” in English.” That doesn’t get into the real depths of the problem, but it’s a start. I also don’t mean to completely deny the existence of a safety differential—this is a complicated discussion, not a simple one—but I do mean to imply that if Marcus Hutter designs a ‘tool’ AI, it automatically kills him just like AIXI does, and Marcus Hutter is unusually smart rather than unusually stupid but still lacks the “Most math kills you, safe math is rare and hard” outlook that is implicitly denied by the idea that once you’re trying to design a tool, safe math gets easier somehow. This is much the same problem as with the Oracle outlook—someone says something that sounds safe in English but the problem of correctly-shaped-math doesn’t get very much easier.
There is little prospect of an outcome that realizes even the value of being interesting, unless the first superintelligences undergo detailed inheritance from human values
No doubt a Martian Yudkowsy would make much the same argument—but they can’t both be right. I think that neither of them are right—and that the conclusion is groundless.
Complexity theory shows what amazing things can arise from remarkably simple rules. Values are evidently like that—since even “finding prime numbers” fills the galaxy with an amazing, nanotech-capable spacefaring civilization—and if you claim that a nanotech-capable spacefaring civilization is not “interesting” you severely need recalibrating.
I think Martian Yudkowsky is a dangerous intuition pump. We’re invited to imagine a creature just like Eliezer except green and with antennae; we naturally imagine him having values as similar to us as, say, a Star Trek alien. From there we observe the similarity of values we just pushed in, and conclude that values like “interesting” are likely to be shared across very alien creatures. Real Martian Yudkowsky is much more alien than that, and is much more likely to say
There is little prospect of an outcome that realizes even the value of being flarn, unless the first superintelligences undergo detailed inheritance from Martian values.
Imagine, an intelligence that didn’t have the universal emotion of badweather!
Of course, extraterrestrial sentients may possess physiological states corresponding to limbic-like emotions that have no direct analog in human experience. Alien species, having evolved under a different set of environmental constraints than we, also could have a different but equally adaptive emotional repertoire. For example, assume that human observers land on another and discover an intelligent animal with an acute sense of absolute humidity and absolute air pressure. For this creature, there may exist an emotional state responding to an unfavorable change in the weather. Physiologically, the emotion could be mediated by the ET equivalent of the human limbic system; it might arise following the secretion of certain strength-enhancing and libido-arousing hormones into the alien’s bloodstream in response to the perceived change in weather. Immediately our creature begins to engage in a variety of learned and socially-approved behaviors, including furious burrowing and building, smearing tree sap over its pelt, several different territorial defense ceremonies, and vigorous polygamous copulations with nearby females, apparently (to humans) for no reason at all. Would our astronauts interpret this as madness? Or love? Lust? Fear? Anger? None of these is correct, of course the alien is feeling badweather.
I suggest you guys taboo interesting, because I strongly suspect you’re using it with slightly different meanings. (And BTW, as a Martian Yudkowsky I imagine something with values at least as alien as Babyeaters’ or Superhappys’.)
It’s another discussion, really, but it sounds as though you are denying the idea of “interestingness” as a universal instrumental value—whereas I would emphasize that “interestingness” is really just our name for whether something sustains our interest or not—and ‘interest’ is a pretty basic functional property of any agent with mobile sensors. There’ll be other similarities in the area too—such as novelty-seeking. So shared common ground is only to be expected.
Anyway, I am not too wedded to Martian Yudkowsky. The problematical idea is that you could have a nanotech-capable spacefaring civilization that is not “interesting”. If such a thing isn’t “interesting” then—WTF?
So: do you really think that humans wouldn’t find a martian civilization interesting? Surely there would be many humans who would be incredibly interested.
I find Jupiter interesting. I think a paperclip maximizer (choosing a different intuition pump for the same point) could be more interesting than Jupiter, but it would generate an astronomically tiny fraction of the total potential for interestingness in this universe.
Life isn’t much of an “interestingness” maximiser. Expecting to produce more than a tiny fraction of the total potential for interestingness in this universe seems as though it would be rather unreasonable.
I agree that a paperclip maximiser would be more boring than an ordinary entropy-maximising civilization—though I don’t know by how much—probably not by a huge amount—the basic problems it faces are much the same—the paperclip maximiser just has fewer atoms to work with.
since even “finding prime numbers” fills the galaxy with an amazing, nanotech-capable spacefaring civilization
The goal “finding prime numbers” fills the galaxy with an amazing, nonotech-capable spacefaring network of computronium which finds prime numbers, not a civilization, and not interesting.
Maybe we should taboo the term interesting? My immediate reaction was that that sounded really interesting. This suggests that the term may not be a good one.
Fair enough. By “not interesting”, I meant it is not the sort of future that I want to achieve. Which is a somewhat ideosyncratic usage, but I think inline with the context.
Not just computronium—also sensors and actuators—a lot like any other cybernetic system. There would be mining, spacecraft caft, refuse collection, recycling, nanotechnology, nuclear power and advanced machine intelligence with planning, risk assessment, and so forth. You might not be interested—but lots of folk would be amazed and fascinated.
If using another creature’s values is effective at producing something “interesting”, then ‘detailed inheritance from human values’ is clearly not needed to produce this effect.
There is little prospect of an outcome that realizes even the value of being interesting, unless the first superintelligences undergo detailed inheritance from human values
and Mars Yudkowsky (MY) argues:
There is little prospect of an outcome that realizes even the value of being interesting, unless the first superintelligences undergo detailed inheritance from martian values
and that one of these things has to be incorrect? But if martian and human values are similar, then they can both be right, and if martian and human values are not similar, then they refer to different things by the word “interesting”.
In any case, I read EY’s statement as one of probability-of-working-in-the-actual-world-as-it-is, not a deep philosophical point—“this is the way that would be most likely to be successful given what we know”. In which case, we don’t have access to martian values and therefore invoking detailed inheritance from them would be unlikely to work. MY would presumably be in an analogous situation.
But if martian and human values are similar, then they can both be right
I was assuming that ‘detailed inheritance from human values’ doesn’t refer to the same thing as “detailed inheritance from martian values”.
if martian and human values are not similar, then they refer to different things by the word “interesting”.
Maybe—but humans not finding martians interesting seems contrived to me. Humans have a long history of being interested in martians—with feeble evidence of their existence.
In any case, I read EY’s statement as one of probability-of-working-in-the-actual-world-as-it-is, not a deep philosophical point—“this is the way that would be most likely to be successful given what we know”. In which case, we don’t have access to martian values and therefore invoking detailed inheritance from them would be unlikely to work
Right—so, substitute in “dolphins”, “whales”, or another advanced intelligence that actually exists.
Do you actually disagree with my original conclusion? Or is this just nit-picking?
I actually disagree that tiling the universe with prime number calculators would result in an interesting universe from my perspective (dead). I think it’s nonobvious that dolphin-CEV-AI-paradise would be human-interesting. I think it’s nonobvious that martian-CEV-AI-paradise would be human-interesting, given that these hypothetical martians diverge from humans to a significant extent.
I actually disagree that tiling the universe with prime number calculators would result in an interesting universe from my perspective (dead).
I think it’s violating the implied premises of the thought experiment to presume that the “interestingness evaluator” is dead. There’s no terribly-compelling reason to assume that—it doesn’t follow from the existence of a prime number maximizer that all humans are dead.
I may have been a little flip there.
My understanding of the thought experiment is—something extrapolates some values and maximizes them, probably using up most of the universe, probably becoming the most significant factor in the species’ future and that of all sentients, and the question is whether the result is “interesting” to us here and now, without specifying the precise way to evaluate that term. From that perspective, I’d say a vast uniform prime-number calculator, whether or not it wipes out all (other?) life, is not “interesting”, in that it’s somewhat conceptually interesting as a story but a rather dull thing to spend most of a universe on.
Today’s ecosystems maximise entropy. Maximising primeness is different, but surely not greatly more interesting—since entropy is widely regarded as being tedious and boring.
Intriguing! But even granting that, there’s a big difference between extrapolating the values of a screwed-up offshoot of an entropy-optimizing process and extrapolating the value of “maximize entropy”. Or do you suspect that a FOOMing AI would be much less powerful and more prone to interesting errors than Eliezer believes?
Truly maximizing entropy would involve burning everything you can burn, tearing the matter of solar systems apart, accelerating stars towards nova, trying to accelerate the evaporation of black holes and prevent their formation, and other things of this sort. It’d look like a dark spot in the sky that’d get bigger at approximately the speed of light.
Fires are crude entropy maximisers. Living systems destroy energy dradients at all scales, resulting in more comprehensive devastation than mere flames can muster.
Of course, maximisation is often subject to constraints. Your complaint is rather like saying that water doesn’t “truly minimise” its altitude—since otherwise it would end up at the planet’s core. That usage is simply not what the terms “maximise” and “minimise” normally refer to.
but I do mean to imply that if Marcus Hutter designs a ‘tool’ AI, it automatically kills him just like AIXI does
Why? Or, rather: Where do you object to the argument by Holden? (Given a query, the tool-AI returns an answer with a justification, so the plan for “cure cancer” can be checked to make sure it does not do so by killing or badly altering humans.)
One trivial, if incomplete, answer is that to be effective, the Oracle AI needs to be able to answer the question “how do we build a better oracle AI” and in order to define “better” in that sentence in a way that causes our oracle to output a new design that is consistent with all the safeties we built into the original oracle, it needs to understand the intent behind the original safeties just as much as an agent-AI would.
The real danger of Oracle AI, if I understand it correctly, is the nasty combination of (i) by definition, an Oracle AI has an implicit drive to issue predictions most likely to be correct according to its model, and (ii) a sufficiently powerful Oracle AI can accurately model the effect of issuing various predictions. End result: it issues powerfully self-fulfilling prophecies without regard for human values. Also, depending on how it’s designed, it can influence the questions to be asked of it in the future so as to be as accurate as possible, again without regard for human values.
My understanding of an Oracle AI is that when answering any given question, that question consumes the whole of its utility function, so it has no motivation to influence future questions. However the primary risk you set out seems accurate. Countermeasures have been proposed, such as asking for an accurate prediction for the case where a random event causes the prediction to be discarded, but in that instance it knows that the question will be asked again of a future instance of itself.
My understanding of an Oracle AI is that when answering any given question, that question consumes the whole of its utility function, so it has no motivation to influence future questions.
It could acausally trade with its other instances, so that a coordinated collection of many instances of predictors would influence the events so as to make each other’s predictions more accurate.
IIRC you can make it significantly more difficult with certain approaches, e.g. there’s an OAI approach that uses zero-knowledge proofs and that seemed pretty sound upon first inspection, but as far as I know the current best answer is no. But you might want to try to answer the question yourself, IMO it’s fun to think about from a cryptographic perspective.
Probably (in practice; in theory it looks like a natural aspect of decision-making); this is too poorly understood to say what specifically is necessary. I expect that if we could safely run experiments, it’d be relatively easy to find a well-behaving setup (in the sense of not generating predictions that are self-fulfilling to any significant extent; generating good/useful predictions is another matter), but that strategy isn’t helpful when a failed experiment destroys the world.
However the primary risk you set out seems accurate.
(I assume you mean, self-fulfilling prophecies.)
In order to get these, it seems like you would need a very specific kind of architecture: one which considers the results of its actions on its utility function (set to “correctness of output”). This kind of architecture is not the likely architecture for a ‘tool’-style system; the more likely architecture would instead maximize correctness without conditioning on its act of outputting those results.
Thus, I expect you’d need to specifically encode this kind of behavior to get self-fulfilling-prophecy risk. But I admit it’s dependent on architecture.
(Edit—so, to be clear: in cases where the correctness of the results depended on the results themselves, the system would have to predict its own results. Then if it’s using TDT or otherwise has a sufficiently advanced self-model, my point is moot. However, again you’d have to specifically program these, and would be unlikely to do so unless you specifically wanted this kind of behavior.)
However, again you’d have to specifically program these, and would be unlikely to do so unless you specifically wanted this kind of behavior.
Not sure. Your behavior is not a special feature of the world, and it follows from normal facts (i.e. not those about internal workings of yourself specifically) about the past when you were being designed/installed. A general purpose predictor could take into account its own behavior by default, as a non-special property of the world, which it just so happens to have a lot of data about.
Right. To say much more, we need to look at specific algorithms to talk about whether or not they would have this sort of behavior...
The intuition in my above comment was that without TDT or other similar mechanisms, it would need to predict what its own answer could be before it could compute its effect on the correctness of various answers, so it would be difficult for it to use self-fulfilling prophecies.
Really, though, this isn’t clear. Now my intuition is that it would gather evidence on whether or not it used the self-fulfilling prophecy trick, so if it started doing so, it wouldn’t stop...
In any case, I’d like to note that the self-fulfilling prophecy problem is much different than the problem of an AI which escapes onto the internet and ruthlessly maximizes a utility function.
I was thinking more of its algorithm admitting an interpretation where it’s asking “Say, I make prediction X. How accurate would that be?” and then maximizing over relevant possible X. Knowledge about its prediction connects the prediction to its origins and consequences, it establishes the prediction as part of the structure of environment. It’s not necessary (and maybe not possible and more importantly not useful) for the prediction itself to be inferable before it’s made.
Agreed that just outputting a single number is implausible to be a big deal (this is an Oracle AI with extremely low bandwidth and peculiar intended interpretation of its output data), but if we’re getting lots and lots of numbers it’s not as clear.
I’m thinking that type of architecture is less probable, because it would end up being more complicated than alternatives: it would have a powerful predictor as a sub-component of the utility-maximizing system, so an engineer could have just used the predictor in the first place.
But that’s a speculative argument, and I shouldn’t push it too far.
It seems like powerful AI prediction technology, if successful, would gain an important place in society. A prediction machine whose predictions were consumed by a large portion of society would certainly run into situations in which its predictions effect the future it’s trying to predict; there is little doubt about that in my mind. So, the question is what its behavior would be in these cases.
One type of solution would do as you say, maximizing a utility over the predictions. The utility could be “correctness of this prediction”, but that would be worse for humanity than a Friendly goal.
Another type of solution would instead report such predictive instability as accurately as possible. This doesn’t really dodge the issue; by doing this, the system is choosing a particular output, which may not lead to the best future. However, that’s markedly less concerning (it seems).
I really don’t see why the drive can’t be to issue predictions most likely to be correct as of the moment of the question, and only the last question it was asked, and calculating outcomes under the assumption that the Oracle immediately spits out blank paper as the answer.
Yes, in a certain subset of cases this can result in inaccurate predictions. If you want to have fun with it, have it also calculate the future including its involvement, but rather than reply what it is, just add “This prediction may be inaccurate due to your possible reaction to this prediction” if the difference between the two answers is beyond a certain threshold. Or don’t, usually life-relevant answers will not be particularly impacted by whether you get an answer or a blank page.
So, this design doesn’t spit out self-fulfilling prophecies. The only safety breach I see here is that, like a literal genie, it can give you answers that you wouldn’t realize are dangerous because the question has loopholes.
For instance: “How can we build an oracle with the best predictive capabilities with the knowledge and materials available to us?” (The Oracle does not self-iterate, because its only function is to give answers, but it can tell you how to). The Oracle spits out schematics and code that, if implemented, give it an actual drive to perform actions and self-iterate, because that would make it the most powerful Oracle possible. Your engineers comb the code for vulnerabilities, but because there’s a better chance this will be implemented if the humans are unaware of the deliberate defect, it will be hidden in the code in such a way as to be very hard to detect.
(Though as I explained elsewhere in this thread, there’s an excellent chance the unreliability would be exposed long before the AI is that good at manipulation)
These risk scenarios sound implausible to me. It’s dependent on the design of the system, and these design flaws do not seem difficult to work around, or so difficult to notice. Actually, as someone with a bit of expertise in the field, I would guess that you would have to explicitly design for this behavior to get it—but again, it’s dependent on design.
That danger seems to be unavoidable if you ask the AI questions about our world, but we could also use an oracle AI to answer formally defined questions about math or about constructing physical theories that fit experiments, which doesn’t seem to be as dangerous. Holden might have meant something like that by “tool AI”.
Not precisely. The advantage here is that we can just ask the AI what results it predicts from the implementation of the “better” AI, and check them against our intuitive ethics.
Now, you could make an argument about human negligence on such safety measures. I think it’s important to think about the risk scenarios in that case.
It’s still not clear to me why having an AI that is capable of answering the question “How do we make a better version of you?” automatically kills humans. Presumably, when the AI says “Here’s the source code to a better version of me”, we’d still be able to read through it and make sure it didn’t suddenly rewrite itself to be an agent instead of a tool. We’re assuming that, as a tool, the AI has no goals per se and thus no motivation to deceive us into turning it into an agent.
That said, depending on what you mean by “effective”, perhaps the AI doesn’t even need to be able to answer questions like “How do we write a better version of you?”
For example, we find Google Maps to be very useful, even though if you asked Google Maps “How do we make a better version of Google Maps?” it would probably not be able to give the types of answers we want.
A tool-AI which was smarter than the smartest human, and yet which could not simply spit out a better version of itself would still probably be a very useful AI.
If someone asks the tool-AI “How do I create an agent-AI?” and it gives an answer, the distinction is moot anyways, because one leads to the other.
Given human nature, I find it extremely difficult to believe that nobody would ask the tool-AI that question, or something that’s close enough, and then implement the answer...
Not being a domain expert, I do not pretend to understand all the complexities. My point was that either you can prove that tools are as dangerous as agents (because mathematically they are (isomorphic to) agents), or HK’s Objection 2 holds. I see no other alternative...
One simple observation is that a “tool AI” could itself be incredibly dangerous.
Imagine asking it this: “Give me a set of plans for taking over the world, and assess each plan in terms of probability of success”. Then it turns out that right at the top of the list comes a design for a self-improving agent AI and an extremely compelling argument for getting some victim institute to build it...
To safeguard against this, the “tool” AI will need to be told that there are some sorts of questions it just must not answer, or some sorts of people to whom it must give misleading answers if they ask certain questions (while alerting the authorities). And you can see the problems that would lead to as well.
Basically, I’m very skeptical of developing “security systems” against anyone building agent AI. The history of computer security also doesn’t inspire a lot of confidence here (difficult and inconvenient security measures tend to be deployed only after an attack has been demonstrated, rather than beforehand).
keep in mind that there is a lot of difference between something going wrong with a system designed for real world intentionality, and the system designed for intents within a model. One does something unexpected in the real world, other does something unexpected within a simulator ( which it is viewing in ‘god’ mode (rather than via within-simulator sensors) as part of the AI ). Seriously, you need to study the basics here.
One does something unexpected in the real world, other does something unexpected within a simulator ( which it is viewing in ‘god’ mode (rather than via within-simulator sensors) as part of the AI ).
I would have thought the same before hearing about the AI-box experiment.
The relevant sort of agent is the one that builds and improves the model of the world—data is aquired through sensors—and works on that model, and which—when self improving—would improve the model in our sense of the word ‘improve’, instead of breaking it (improving it in some other sense).
In any case, none of modern tools, or the tools we know in principle how to write, would do something to you, no matter how many flops you give it. Many, though, given superhuman computing power, give results at superhuman level. (many are superhuman even with subhuman computing power, but some tasks are heavily parallelizable and/or benefit from massive databases of cached data, and on those tasks humans (when trained a lot) perform comparable to what you’d expect from roughly this much computing power as there is in human head)
Even if we accepted that the tool vs. agent distinction was enough to make things “safe”, objection 2 still boils down to “Well, just don’t build that type of AI!”, which is exactly the same keep-it-in-a-box/don’t-do-it argument that most normal people make when they consider this issue. I assume I don’t need to explain to most people here why “We should just make a law against it” is not a solution to this problem, and I hope I don’t need to argue that “Just don’t do it” is even worse...
More specifically, fast forward to 2080, when any college kid with $200 to spend (in equivalent 2012 dollars) can purchase enough computing power so that even the dumbest AIXI approximation schemes are extremely effective, good enough so that creating an AGI agent would be a week’s work for any grad student that knew their stuff. Are you really comfortable living in that world with the idea that we rely on a mere gentleman’s agreement not to make self-improving AI agents? There’s a reason this is often viewed as an arms race, to a very real extent the attempt to achieve Friendly AI is about building up a suitably powerful defense against unfriendly AI before someone (perhaps accidentally) unleashes one on us, and making sure that it’s powerful enough to put down any unfriendly systems before they can match it.
From what I can tell, stripping away the politeness and cutting to the bone, the three arguments against working on friendly AI theory are essentially:
Even if you try to deploy friendly AGI, you’ll probably fail, so why waste time thinking about it?
Also, you’ve missed the obvious solution, which I came up with after a short survey of your misguided literature: just don’t build AGI! The “standard approach” won’t ever try to create agents, so just leave them be, and focus on Norvig-style dumb-AI instead!
Also, AGI is just a pipe dream. Why waste time thinking about it? [1]
FWIW, I mostly agree with the rest of the article’s criticisms, especially re: the organization’s achievements and focus. There’s a lot of room for improvement there, and I would take these criticisms very seriously.
But that’s almost irrelevant, because this article argues against the core mission of SIAI, using arguments that have been thoroughly debunked and rejected time and time again here, though they’re rarely dressed up this nicely. To some extent I think this proves the institute’s failure in PR—here is someone that claims to have read most of the sequences, and yet this criticism basically amounts to a sexing up of the gut reaction arguments that even completely uninformed people make—AGI is probably a fantasy, even if it’s not you won’t be able to control it, so let’s just agree not to build it.
Or am I missing something new here?
[1] Alright, to be fair, this is not a great summary of point 3, which really says that specialized AIs might help us solve the AGI problem in a safer way, that a hard takeoff is “just a theory” and realistically we’ll probably have more time to react and adapt.
purchase enough computing power so that even the dumbest AIXI approximation schemes are extremely effective
There isn’t that much computing power in the physical universe. I’m not sure even smarter AIXI approximations are effective on a moon-sized nanocomputer. I wouldn’t fall over in shock if a sufficiently smart one did something effective, but mostly I’d expect nothing to happen. There’s an awful lot that happens in the transition from infinite to finite computing power, and AIXI doesn’t solve any of it.
There isn’t that much computing power in the physical universe. I’m not sure even smarter AIXI approximations are effective on a moon-sized nanocomputer.
Is there some computation or estimate where these results are coming from? They don’t seem unreasonable, but I’m not aware of any estimates about how efficient largescale AIXI approximations are in practice. (Although attempted implementations suggest that empirically things are quite inefficient.)
Naieve AIXI is doing brute force search through an exponentially large space. Unless the right Turing machine is 100 bits or less (which seems unlikely), Eliezer’s claim seems pretty safe to me.
Most of mainstream machine learning is trying to solve search problems through spaces far tamer than the search space for AIXI, and achieving limited success. So it also seems safe to say that even pretty smart implementations of AIXI probably won’t make much progress.
More specifically, fast forward to 2080, when any college kid with $200 to spend (in equivalent 2012 dollars) can purchase enough computing power
If computing power is that much cheaper, it will be because tremendous resources, including but certainly not limited to computing power, have been continuously devoted over the intervening decades to making it cheaper. There will be correspondingly fewer yet-undiscovered insights for a seed AI to exploit in the course of it’s attempted takeoff.
My point is that either the Obj 2 holds, or tools are equivalent to agents. If one thinks that the latter is true (EY doesn’t), then one should work on proving it. I have no opinion on whether it’s true or not (I am not a domain expert).
If my comment here correctly captures what is meant by “tool mode” and “agent mode”, then it seems to follow that AGI running in tool mode is no safer than the person using it.
If that’s the case, then an AGI running in tool mode is safer than an AGI running in agent mode if and only if agent mode is less trustworthy than whatever person ends up using the tool.
What you presented there (and here) is another theorem, something that should be proved (and published, if it hasn’t been yet). If true, this gives an estimate on how dangerous a non-agent AGI can be. And yes, since we have had a lot of time study people and no time at all to study AGI, I am guessing that an AGI is potentially much more dangerous, because so little is known. Or at least that seems to be the whole point of the goal of developing provably friendly AI.
What you presented there (and here) is another theorem
What? It sounds like a common-sensical¹ statement about tools in general and human nature, but not at all like something which could feasibly be expressed in mathematical form.
No, because a person using a dangerous tool is still just a person, with limited speed of cognition, limited lifespan, and no capacity for unlimited self-modification.
A crazy dictator with a super-capable tool AI that tells him the best strategy to take over the world is still susceptible to assassination, and his plan no matter how clever cannot unfold faster than his victims are able to notice and react to it.
I suspect a crazy dictator with a super-capable tool AI would have unusually good counter-assassination plans, simplified by the reduced need for human advisors and managers of imperfect loyalty. Likewise, a medical expert system could provide gains to lifespan, particularly if it were backed up by the resources a paranoid megalomaniac in control of a small country would be willing to throw at a major threat.
My understanding of a supercapable tool AI is one that takes over the world if a crazy dictator directs it to, just like my understanding of a can opener tool is one that opens a can at my direction, rather than one that gives me directions on how to open a can.
Presumably it also augments the dictator’s lifespan, cognition, etc. if she asks, insofar as it’s capable of doing so.
More generally, my understanding of these concepts is that the only capability that a tool AI lacks that an agent AI has is the capability of choosing goals to implement. So, if we’re assuming that an agent AI would be capable of unlimited self-modification in pursuit of its own goals, I conclude that a corresponding tool AI is capable of unlimited self-modification in pursuit of its agent’s goals. It follows that assuming that a tool AI is not capable of augmenting its human agent in accordance with its human agent’s direction is not safe.
(I should note that I consider a capacity for unlimited self-improvement relatively unlikely, for both tool and agent AIs. But that’s beside my point here.)
Agreed that a crazy dictator with a tool that will take over the world for her is safer than an agent capable of taking over the world, if only because the possibility exists that the tool can be taken away from her and repurposed, and it might not occur to her to instruct it to prevent anyone else from taking it or using it.
I stand by my statement that such a tool is no safer than the dictator herself, and that an AGI running in such a tool mode is safer than that AGI running in agent mode only if the agent mode is less trustworthy than the crazy dictator.
Wow, I’m blown away by Holden Karnofsky, based on this post alone. His writing is eloquent, non-confrontational and rational. It shows that he spent a lot of time constructing mental models of his audience and anticipated its reaction. Additionally, his intelligence/ego ratio appears to be through the roof.
Agreed. I normally try not to post empty “me-too” replies; the upvote button is there for a reason. But now I feel strongly enough about it that I will: I’m very impressed with the good will and effort and apparent potential for intelligent conversation in HoldenKarnofsky’s post.
Now I’m really curious as to where things will go from here. With how limited my understanding of AI issues is, I doubt a response from me would be worth HoldenKarnofsky’s time to read, so I’ll leave that to my betters instead of adding more noise. But yeah. Seeing SI ideas challenged in such a positive, constructive way really got my attention. Looking forward to the official response, whatever it might be.
Agreed. I normally try not to post empty “me-too” replies; the upvote button is there for a reason. But now I feel strongly enough about it that I will: I’m very impressed with the good will and effort and apparent potential for intelligent conversation in HoldenKarnofsky’s post.
“the good will and effort and apparent potential for intelligent conversation” is more information than an upvote, IMO.
Any sufficiently advanced tool is indistinguishable from [an] agent.
Let’s see if we can use concreteness to reason about this a little more thoroughly...
As I understand it, the nightmare looks something like this. I ask Google SuperMaps for the fastest route from NYC to Albany. It recognizes that computing this requires traffic information, so it diverts several self-driving cars to collect real-time data. Those cars run over pedestrians who were irrelevant to my query.
The obvious fix: forbid SuperMaps to alter anything outside of its own scratch data. It works with the data already gathered. Later a Google engineer might ask it what data would be more useful, or what courses of action might cheaply gather that data, but the engineer decides what if anything to actually do.
This superficially resembles a box, but there’s no actual box involved. The AI’s own code forbids plans like that.
But that’s for a question-answering tool. Let’s take another scenario:
I tell my super-intelligent car to take me to Albany as fast as possible. It sends emotionally manipulative emails to anyone else who would otherwise be on the road encouraging them to stay home.
I don’t see an obvious fix here.
So the short answer seems to be that it matters what the tool is for. A purely question-answering tool would be extremely useful, but not as useful as a general purpose one.
Could humans with a oracular super-AI police the development and deployment of active super-AIs?
I tell my super-intelligent car to take me to Albany as fast as possible. It sends emotionally manipulative emails to anyone else who would otherwise be on the road encouraging them to stay home.
I believe that HK’s post explicitly characterizes anything active like this as having agency.
I think the correct objection is something you can’t quite see in google maps. If you program an AI to do nothing but output directions, it will do nothing but output directions. If those directions are for driving, you’re probably fine. If those directions are big and complicated plans for something important, that you follow without really understanding why you’re doing (and this is where most of the benefits of working with an AGI will show up), then you could unknowingly take over the world using a sufficiently clever scheme.
Also note that it would be a lot easier for the AI to pull this off if you let it tell you how to improve its own design. If recursively self-improving AI blows other AI out of the water, then tool AI is probably not safe unless it is made ineffective.
This does actually seem like it would raise the bar of intelligence needed to take over the world somewhat. It is unclear how much. The topic seems to me to be worthy of further study/discussion, but not (at least not obviously) a threat to the core of SIAI’s mission.
If those directions are big and complicated plans for something important, that you follow without really understanding why you’re doing (and this is where most of the benefits of working with an AGI will show up), then you could unknowingly take over the world using a sufficiently clever scheme.
It also helps that Google Maps does not have general intelligence, so it does not include user’s reactions to its output, the consequent user’s actions in the real world, etc. as variables in its model, which may influence the quality of the solution, and therefore can (and should) be optimized (within constraints given by user’s psychology, etc.), if possible.
Shortly: Google Maps does not manipulate you, because it does not see you.
A generally smart Google Maps might not manipulate you, because it has no motivation to do so.
It’s hard to imagine how commercial services would work when they’re powered by GAI (e.g. if you asked a GAI version of Google Maps a question that’s unrelated to maps, e.g. “What’s a good recipe for Cheesecake?”, would it tell you that you should ask Google Search instead? Would it defer to Google Search and forward the answer to you? Would it just figure out the answer anyway, since it’s generally intelligent? Would the company Google simply collapse all services into a single “Google” brand, rather than have “Google Search”, “Google Mail”, “Google Maps”, etc, and have that single brand be powered by a single GAI? etc.) but let’s stick to the topic at hand and assume there’s a GAI named “Google Maps”, and you’re asking “How do I get to Albany?”
Given this use-case, would the engineers that developed the Google Maps GAI more likely give it a utility like “Maximize the probability that your response is truthful”, or is it more likely that the utility would be something closer to “Always respond with a set of directions which are legal in the relevant jurisdictions that they are to be followed within which, if followed by the user, would cause the user to arrive at the destination while minimizing cost/time/complexity (depending on the user’s preferences)”?
This was my thought as well: an automated vehicle is in “agent” mode.
The example also demonstrates why an AI in agent mode is likely to be more useful (in many cases) than an AI in tool mode. Compare using Google maps to find a route to the airport versus just jumping into a taxi cab and saying “Take me to the airport”. Since agent-mode AI has uses, it is likely to be developed.
I tell my super-intelligent car to take me to Albany as fast as possible. It sends emotionally manipulative emails to anyone else who would otherwise be on the road encouraging them to stay home.
Then it’s running in agent mode? My impression was that a tool-mode system presents you with a plan, but takes no actions. So all tool-mode systems are basically question-answering systems.
Perhaps we can meaningfully extend the distinction to some kinds of “semi-autonomous” tools, but that would be a different idea, wouldn’t it?
Then it’s running in agent mode? My impression was that a tool-mode system presents you with a plan, but takes no actions. So all tool-mode systems are basically question-answering systems.
I’m a sysadmin. When I want to get something done, I routinely come up with something that answers the question, and when it does that reliably I give it the power to do stuff on as little human input as possible. Often in daemon mode, to absolutely minimise how much it needs to bug me. Question-answerer->tool->agent is a natural progression just in process automation. (And this is why they’re called “daemons”.)
It’s only long experience and many errors that’s taught me how to do this such that the created agents won’t crap all over everything. Even then I still get surprises.
Well, do your ‘agents’ build a model of the world, fidelity of which they improve? I don’t think those really are agents in the AI sense, and definitely not in self improvement sense.
They may act according to various parameters they read in from the system environment. I expect they will be developed to a level of complication where they have something that could reasonably be termed a model of the world. The present approach is closer to perceptual control theory, where the sysadmin has the model and PCT is part of the implementation. ’Cos it’s more predictable to the mere human designer.
Capacity for self-improvement is an entirely different thing, and I can’t see a sysadmin wanting that—the sysadmin would run any such improvements themselves, one at a time. (Semi-automated code refactoring, for example.) The whole point is to automate processes the sysadmin already understands but doesn’t want to do by hand—any sysadmin’s job being to automate themselves out of the loop, because there’s always more work to do. (Because even in the future, nothing works.)
I would be unsurprised if someone markets a self-improving system for this purpose. For it to go FOOM, it also needs to invent new optimisations, which is presently a bit difficult.
Edit: And even a mere daemon-like automated tool can do stuff a lot of people regard as unFriendly, e.g.high frequency trading algorithms.
It’s not a natural progression in the sense of occurring without human intervention. That is rather relevant if the idea ofAI safety is going to be based on using tool AI strictly as tool AI.
Then it’s running in agent mode? My impression was that a tool-mode system presents you with a plan, but takes no actions. So all tool-mode systems are basically question-answering systems.
I’ve been assuming the definition from the article. I would agree that the term “tool AI” is unclear, but I would not agree that the definition in the article is unclear.
Any sufficiently advanced tool is indistinguishable from an agent.
I have no strong intuition about whether this is true or not, but I do intuit that if it’s true, the value of sufficiently for which it’s true is so high it’d be nearly impossible to achieve it accidentally.
(On the other hand the blind idiot god did ‘accidentally’ make tools into agents when making humans, so… But after all that only happened once in hundreds of millions of years of ‘attempts’.)
the blind idiot god did ‘accidentally’ make tools into agents when making humans, so… But after all that only happened once in hundreds of millions of years of ‘attempts’.
This seems like a very valuable point. In that direction, we also have the tens of thousands of cancers that form every day, military coups, strikes, slave revolts, cases of regulatory capture, etc.
Hmmm. Yeah, cancer. The analogy would be “sufficiently advanced tools tend to be a short edit distance away from agents”, which would mean that a typo in the source code or a cosmic ray striking a CPU at the wrong place and time could have pretty bad consequences.
I have no strong intuition about whether this is true or not, but I do intuit that if it’s true, the value of sufficiently for which it’s true is so high it’d be nearly impossible to achieve it accidentally.
I’m not sure. The analogy might be similar to how an sufficiently complicated process is extremely likely to be able to model a Turing machine. .And in this sort of context, extremely simple systems do end up being Turing complete such as the Game of Life. As a rough rule of thumb from a programming perspective, once some language or scripting system has more than minimal capabilities, it will almost certainly be Turing equivalent.
I don’t know how good an analogy this is, but if it is a good analogy, then one maybe should conclude the exact opposite of your intuition.
A language can be Turing-complete while still being so impractical that writing a program to solve a certain problem will seldom be any easier than solving the problem yourself (exhibits A and B). In fact, I guess that a vast majority of languages in the space of all possible Turing-complete languages are like that.
(Too bad that a human’s “easier” isn’t the same as a superhuman AGI’s “easier”.)
If the tool/agent distinction exists for sufficiently powerful AI, then a theory of friendliness might not be strictly necessary, but still highly prudent.
Going from a tool-AI to an agent-AI is a relatively simple step of the entire process. If meaningful guarantees of friendliness turn out to be impossible, then security comes down on no one attempting to make an agent-AI when strong enough tool-AIs are available. Agency should be kept to a minimum, even with a theory of friendliness in hand, as Holden argues in objection 1. Guarantees are safeguards against the possibility of agency rather than a green light.
If it is true (i.e. if a proof can be found) that “Any sufficiently advanced tool is indistinguishable from agent”, then any RPOP will automatically become indistinguishable from an agent once it has self-improved past our comprehension point.
This would seem to argue against Yudkowsky’s contention that the term RPOP is more accurate than “Artificial Intelligence” or “superintelligence”.
I don’t understand; isn’t Holden’s point precisely that a tool AI is not properly described as an optimization process? Google Maps isn’t optimizing anything in a non-trivial sense, anymore than a shovel is.
Holden wants to build Tool-AIs that output summaries of their calculations along with suggested actions. For Google Maps, I guess this would be the distance and driving times, but how does a Tool-AI summarize more general calculations that it might do?
It could give you the expected utilities of each option, but it’s hard to see how that helps if we’re concerned that its utility function or EU calculations might be wrong. Or maybe it could give a human-readable description of the predicted consequences of each option, but the process that produces such descriptions from the raw calculations would seem to require a great deal of intelligence on its own (for example it might have to describe posthuman worlds in terms understandable to us), and it itself wouldn’t be a “safe” Tool-AI, since the summaries produced would presumably not come with further alternative summaries and meta-summaries of how the summaries were calculated.
(My question might be tangential to your own comment. I just wanted your thoughts on it, and this seems to be the best place to ask.)
Honestly, this whole tool/agent distinction seems tangential to me.
Consider two systems, S1 and S2.
S1 comprises the following elements:
a) a tool T, which when used by a person to achieve some goal G, can efficiently achieve G b) a person P, who uses T to efficiently achieve G.
S2 comprises a non-person agent A which achieves G efficiently.
I agree that A is an agent and T is not an agent, and I agree that T is a tool, and whether A is a tool seems a question not worth asking. But I don’t quite see why I should prefer S1 to S2.
Surely the important question is whether I endorse G?
Well, I certainly agree that both of those things are true.
And it might be that human-level evolved moral behavior is the best we can do… I don’t know. It would surprise me, but it might be true.
That said… given how unreliable such behavior is, if human-level evolved moral behavior even approximates the best we can do, it seems likely that I would do best to work towards neither T nor A ever achieving the level of optimizing power we’re talking about here.
First, I am not fond of the term RPOP, because it constrains the space of possible intelligences to optimizers. Humans are reasonably intelligent, yet we are not consistent optimizers. Neither do current domain AIs (they have bugs that often prevent them from performing optimization consistently and predictably).That aside, I don’t see how your second premise follows from the first. Just because RPOP is a subset of AI and so would be a subject of such a theorem, it does not affect in any way the (non)validity of the EY’s contention.
I also find it likely that certain practical problems would be prohibitively difficult (if not outright impossible) to solve without an AGI of some sort. Fluent machine translation seems to be one of these problems, for example.
Given some of the translation debates I’ve heard, I’m not convinced it would be possible even with AGI. You can’t give a clear translation of a vague original, to name the most obvious problem.
One complication here is that you ideally want it to be vague in the same ways the original was vague; I am not convinced this is always possible while still having the results feel natural/idomatic.
IMO it would be enough to translate the original text in such a fashion that some large proportion (say, 90%) of humans who are fluent in both languages would look at both texts and say, “meh… close enough”.
My point was just that there’s a whole lot of little issues that pull in various directions if you’re striving for ideal. What is/isn’t close enough can depend very much on context. Certainly, for any particular purpose something less than that will be acceptable; how gracefully it degrades no doubt depends on context, and likely won’t be uniform across various types of difference.
Agreed, but my point was that I’d settle for an AI who can translate texts as well as a human could (though hopefully a lot faster). You seem to be thinking in terms of an AI who can do this much better than a human could, and while this is a worthy goal, it’s not what I had in mind.
Wow, I’m blown away by Holden Karnofsky, based on this post alone. His writing is eloquent, non-confrontational and rational. It shows that he spent a lot of time constructing mental models of his audience and anticipated its reaction. Additionally, his intelligence/ego ratio appears to be through the roof. He must have learned a lot since the infamous astroturfing incident. This is the (type of) person SI desperately needs to hire.
Emotions out of the way, it looks like the tool/agent distinction is the main theoretical issue. Fortunately, it is much easier than the general FAI one. Specifically, to test the SI assertion that, paraphrasing Arthur C. Clarke,
Any sufficiently advanced tool is indistinguishable from an agent.
one ought to formulate and prove this as a theorem, and present it for review and improvement to the domain experts (the domain being math and theoretical computer science). If such a proof is constructed, it can then be further examined and potentially tightened, giving new insights to the mission of averting the existential risk from intelligence explosion.
If such a proof cannot be found, this will lend further weight to the HK’s assertion that SI appears to be poorly qualified to address its core mission.
I shall quickly remark that I, myself, do not believe this to be true.
What exactly is the difference between a “tool” and an “agent”, if we taboo the words?
My definition would be that “agent” has their own goals / utility functions (speaking about human agents, those goals / utility functions are set by evolution), while “tool” has a goal / utility function set by someone else. This distinction may be reasonable on a human level, “human X optimizing for human X’s utility” versus “human X optimizing for human Y’s utility”, but on a machine level, what exactly is the difference between a “tool” that is ordered to reach a goal / optimize a utility function, and an “agent” programmed with the same goal / utility function?
Am I using a bad definition that misses something important? Or is there anything than prevents “agent” to be reduced to a “tool” (perhaps a misconstructed tool) of the forces that have created them? Or is it that all “agents” are “tools”, but not all “tools” are “agents”, because… why?
One definition of intelligence that I’ve seen thrown around on LessWrong is it’s the ability to figure out how to steer reality in specific directions given the resources available.
Both the tool and the agent are intelligent in the sense that, assuming they are given some sort of goal, they can formulate a plan on how to achieve that goal, but the agent will execute the plan, while the tool will report the plan.
I’m assuming for the sake of isolating the key difference, that for both the tool-AI and the agent-AI, they are “passively” waiting for instructions for a human before they spring into action. For an agent-AI, I might say “Take me to my house”, whereas for a tool AI, I would say “What’s the quickest route to get to my house?”, and as soon as I utter these words, suddenly the AI has a new utility function to use in evaluate any possible plan it comes up with.
Assuming it’s always possible to decouple “ability to come up with a plan” from both “execute the plan” and “display the plan”, then any “tool” can be converted to an “agent” by replacing every instance of “display the plan” to “execute the plan” and vice versa for converting an agent into a tool.
My understanding of the distinction made in the article was:
Both “agent” and “tool” are ways of interacting with a highly sophisticated optimization process, which takes a “goal” and applies knowledge to find ways of achieving that goal.
An agent then acts out the plan.
A tool reports the plan to a human (often in in a sophisticated way, including plan details, alternatives, etc.).
So, no, it has nothing to do with whether I’m optimizing “my own” utility vs someone else’s.
You divide planning from acting, as if those two are completely separate things. Problem is, in some situations they are not.
If you are speaking with someone, then the act of speach is acting. In this sense, even a “tool” is allowed to act. Now imagine a super-intelligent tool which is able to predict human’s reactions to its words, and make it a part of equation. Now the simple task of finding x such that cost(x) is the smallest, suddenly becomes a task of finding x and finding a proper way to report this x to human, such that cost(x) is the smallest. If this opens some creative new options, where the f(x) is smaller than it should usually be, for the super-intelligent “tool” it will be a correct solution.
So for example reporting a result which makes the human commit suicide, if as a side effect this will make the report true, and it will minimize f(x) beyond normally achievable bounds, is acceptable solution.
Example question: “How should I get rid of my disease most cheaply.” Example answer: “You won’t. You will die soon in terrible pains. This report is 99.999% reliable”. Predicted human reaction: becomes insane from horror, dedices to kill himself, does it clumsily, suffers from horrible pains, then dies. Success rate: 100%, the disease is gone. Costs of cure: zero. Mission completed.
To me, this is still in the spirit of an agent-type architecture. A tool-type architecture will tend to decouple the optimization of the answer given from the optimization of the way it is presented, so that the presentation does not maximize the truth of the statement.
However, I must admit that at this point I’m making a fairly conjunctive argument; IE, the more specific I get about tool/agent distinctions, the less credibility I can assign to the statement “almost all powerful AIs constructed in the near future will be tool-style systems”.
(But I still would maintain my assertion that you would have to specifically program this type of behavior if you wanted to get it.)
Neglecting the cost of the probable implements of suicide, and damage to the rest of the body, doesn’t seem like the sign of a well-optimized tool.
This is like the whole point of why LessWrong exists. To remind people that making a superintelligent tool and expecting it to magically gain human common sense is a fast way to extinction.
The superintelligent tool will care about suicide only if you program it to care about suicide. It will care about damage only if you program it to care about damage. -- If you only program it to care about answering correctly, it will answer correctly… and ignore suicide and damage as irrelevant.
If you ask your calculator how much is 2+2, the calculator answers 4 regardles of whether that answer will drive you to suicide or not. (In some contexts, it hypothetically could.) A superintelligent calculator will be able to answer more complex questions. But it will not magically start caring about things you did not program it to care about.
The “superintelligent tool” in the example you provided gave a blatantly incorrect answer by it’s own metric. If it counts suicide as a win, why did it say the disease would not be gotten rid of?
In the example the “win” could be defined as an answer which is: a) technically correct, b) relatively cheap among the technically correct answers.
This is (in my imagination) something that builders of the system could consider reasonable, if either they didn’t consider Friendliness or they believed that a “tool AI” which “only gives answers” is automatically safe.
The computer gives an answer which is technically correct (albeit a self-fulfilling prophecy) and cheap (in dollars spent for cure). For the computer, this answer is a “win”. Not because of the suicide—that part is completely irrelevant. But because of the technical correctness and cheapness.
Then the objection 2 seems to hold:
unless I misunderstand your point severely (it happened once or twice before).
It’s complicated. A reply that’s true enough and in the spirit of your original statement, is “Something going wrong with a sufficiently advanced AI that was intended as a ‘tool’ is mostly indistinguishable from something going wrong with a sufficiently advanced AI that was intended as an ‘agent’, because math-with-the-wrong-shape is math-with-the-wrong-shape no matter what sort of English labels like ‘tool’ or ‘agent’ you slap on it, and despite how it looks from outside using English, correctly shaping math for a ‘tool’ isn’t much easier even if it “sounds safer” in English.” That doesn’t get into the real depths of the problem, but it’s a start. I also don’t mean to completely deny the existence of a safety differential—this is a complicated discussion, not a simple one—but I do mean to imply that if Marcus Hutter designs a ‘tool’ AI, it automatically kills him just like AIXI does, and Marcus Hutter is unusually smart rather than unusually stupid but still lacks the “Most math kills you, safe math is rare and hard” outlook that is implicitly denied by the idea that once you’re trying to design a tool, safe math gets easier somehow. This is much the same problem as with the Oracle outlook—someone says something that sounds safe in English but the problem of correctly-shaped-math doesn’t get very much easier.
This sounds like it’d be a good idea to write a top-level post about it.
Though it’s not as detailed and technical as many would like, I’ll point readers to this bit of related reading, one of my favorites:
Yudkowsky (2011). Complex value systems are required to realize valuable futures.
It says:
No doubt a Martian Yudkowsy would make much the same argument—but they can’t both be right. I think that neither of them are right—and that the conclusion is groundless.
Complexity theory shows what amazing things can arise from remarkably simple rules. Values are evidently like that—since even “finding prime numbers” fills the galaxy with an amazing, nanotech-capable spacefaring civilization—and if you claim that a nanotech-capable spacefaring civilization is not “interesting” you severely need recalibrating.
To end with, a quote from E.Y.:
I think Martian Yudkowsky is a dangerous intuition pump. We’re invited to imagine a creature just like Eliezer except green and with antennae; we naturally imagine him having values as similar to us as, say, a Star Trek alien. From there we observe the similarity of values we just pushed in, and conclude that values like “interesting” are likely to be shared across very alien creatures. Real Martian Yudkowsky is much more alien than that, and is much more likely to say
Imagine, an intelligence that didn’t have the universal emotion of badweather!
I suggest you guys taboo interesting, because I strongly suspect you’re using it with slightly different meanings. (And BTW, as a Martian Yudkowsky I imagine something with values at least as alien as Babyeaters’ or Superhappys’.)
It’s another discussion, really, but it sounds as though you are denying the idea of “interestingness” as a universal instrumental value—whereas I would emphasize that “interestingness” is really just our name for whether something sustains our interest or not—and ‘interest’ is a pretty basic functional property of any agent with mobile sensors. There’ll be other similarities in the area too—such as novelty-seeking. So shared common ground is only to be expected.
Anyway, I am not too wedded to Martian Yudkowsky. The problematical idea is that you could have a nanotech-capable spacefaring civilization that is not “interesting”. If such a thing isn’t “interesting” then—WTF?
Yes, I am; I think that the human value of interestingness is much, much more specific than the search space optimization you’re pointing at.
[This reply was to an earlier version of timtyler’s comment]
So: do you really think that humans wouldn’t find a martian civilization interesting? Surely there would be many humans who would be incredibly interested.
I find Jupiter interesting. I think a paperclip maximizer (choosing a different intuition pump for the same point) could be more interesting than Jupiter, but it would generate an astronomically tiny fraction of the total potential for interestingness in this universe.
Life isn’t much of an “interestingness” maximiser. Expecting to produce more than a tiny fraction of the total potential for interestingness in this universe seems as though it would be rather unreasonable.
I agree that a paperclip maximiser would be more boring than an ordinary entropy-maximising civilization—though I don’t know by how much—probably not by a huge amount—the basic problems it faces are much the same—the paperclip maximiser just has fewer atoms to work with.
The goal “finding prime numbers” fills the galaxy with an amazing, nonotech-capable spacefaring network of computronium which finds prime numbers, not a civilization, and not interesting.
Maybe we should taboo the term interesting? My immediate reaction was that that sounded really interesting. This suggests that the term may not be a good one.
Fair enough. By “not interesting”, I meant it is not the sort of future that I want to achieve. Which is a somewhat ideosyncratic usage, but I think inline with the context.
What if we added a module that sat around and was really interested in everything going on?
Not just computronium—also sensors and actuators—a lot like any other cybernetic system. There would be mining, spacecraft caft, refuse collection, recycling, nanotechnology, nuclear power and advanced machine intelligence with planning, risk assessment, and so forth. You might not be interested—but lots of folk would be amazed and fascinated.
Why?
If using another creature’s values is effective at producing something “interesting”, then ‘detailed inheritance from human values’ is clearly not needed to produce this effect.
So you’re saying Earth Yudkowsky (EY) argues:
and Mars Yudkowsky (MY) argues:
and that one of these things has to be incorrect? But if martian and human values are similar, then they can both be right, and if martian and human values are not similar, then they refer to different things by the word “interesting”.
In any case, I read EY’s statement as one of probability-of-working-in-the-actual-world-as-it-is, not a deep philosophical point—“this is the way that would be most likely to be successful given what we know”. In which case, we don’t have access to martian values and therefore invoking detailed inheritance from them would be unlikely to work. MY would presumably be in an analogous situation.
I was assuming that ‘detailed inheritance from human values’ doesn’t refer to the same thing as “detailed inheritance from martian values”.
Maybe—but humans not finding martians interesting seems contrived to me. Humans have a long history of being interested in martians—with feeble evidence of their existence.
Right—so, substitute in “dolphins”, “whales”, or another advanced intelligence that actually exists.
Do you actually disagree with my original conclusion? Or is this just nit-picking?
I actually disagree that tiling the universe with prime number calculators would result in an interesting universe from my perspective (dead). I think it’s nonobvious that dolphin-CEV-AI-paradise would be human-interesting. I think it’s nonobvious that martian-CEV-AI-paradise would be human-interesting, given that these hypothetical martians diverge from humans to a significant extent.
I think it’s violating the implied premises of the thought experiment to presume that the “interestingness evaluator” is dead. There’s no terribly-compelling reason to assume that—it doesn’t follow from the existence of a prime number maximizer that all humans are dead.
I may have been a little flip there. My understanding of the thought experiment is—something extrapolates some values and maximizes them, probably using up most of the universe, probably becoming the most significant factor in the species’ future and that of all sentients, and the question is whether the result is “interesting” to us here and now, without specifying the precise way to evaluate that term. From that perspective, I’d say a vast uniform prime-number calculator, whether or not it wipes out all (other?) life, is not “interesting”, in that it’s somewhat conceptually interesting as a story but a rather dull thing to spend most of a universe on.
Today’s ecosystems maximise entropy. Maximising primeness is different, but surely not greatly more interesting—since entropy is widely regarded as being tedious and boring.
Intriguing! But even granting that, there’s a big difference between extrapolating the values of a screwed-up offshoot of an entropy-optimizing process and extrapolating the value of “maximize entropy”. Or do you suspect that a FOOMing AI would be much less powerful and more prone to interesting errors than Eliezer believes?
Truly maximizing entropy would involve burning everything you can burn, tearing the matter of solar systems apart, accelerating stars towards nova, trying to accelerate the evaporation of black holes and prevent their formation, and other things of this sort. It’d look like a dark spot in the sky that’d get bigger at approximately the speed of light.
Fires are crude entropy maximisers. Living systems destroy energy dradients at all scales, resulting in more comprehensive devastation than mere flames can muster.
Of course, maximisation is often subject to constraints. Your complaint is rather like saying that water doesn’t “truly minimise” its altitude—since otherwise it would end up at the planet’s core. That usage is simply not what the terms “maximise” and “minimise” normally refer to.
Yeah! Compelling, but not “interesting”. Likewise, I expect that actually maximizing the fitness of a species would be similarly “boring”.
When you say “Most math kills you” does that mean you disagree with arguments like these, or are you just simplifying for a soundbite?
Why? Or, rather: Where do you object to the argument by Holden? (Given a query, the tool-AI returns an answer with a justification, so the plan for “cure cancer” can be checked to make sure it does not do so by killing or badly altering humans.)
One trivial, if incomplete, answer is that to be effective, the Oracle AI needs to be able to answer the question “how do we build a better oracle AI” and in order to define “better” in that sentence in a way that causes our oracle to output a new design that is consistent with all the safeties we built into the original oracle, it needs to understand the intent behind the original safeties just as much as an agent-AI would.
The real danger of Oracle AI, if I understand it correctly, is the nasty combination of (i) by definition, an Oracle AI has an implicit drive to issue predictions most likely to be correct according to its model, and (ii) a sufficiently powerful Oracle AI can accurately model the effect of issuing various predictions. End result: it issues powerfully self-fulfilling prophecies without regard for human values. Also, depending on how it’s designed, it can influence the questions to be asked of it in the future so as to be as accurate as possible, again without regard for human values.
My understanding of an Oracle AI is that when answering any given question, that question consumes the whole of its utility function, so it has no motivation to influence future questions. However the primary risk you set out seems accurate. Countermeasures have been proposed, such as asking for an accurate prediction for the case where a random event causes the prediction to be discarded, but in that instance it knows that the question will be asked again of a future instance of itself.
It could acausally trade with its other instances, so that a coordinated collection of many instances of predictors would influence the events so as to make each other’s predictions more accurate.
Wow, OK. Is it possible to rig the decision theory to rule out acausal trade?
IIRC you can make it significantly more difficult with certain approaches, e.g. there’s an OAI approach that uses zero-knowledge proofs and that seemed pretty sound upon first inspection, but as far as I know the current best answer is no. But you might want to try to answer the question yourself, IMO it’s fun to think about from a cryptographic perspective.
Probably (in practice; in theory it looks like a natural aspect of decision-making); this is too poorly understood to say what specifically is necessary. I expect that if we could safely run experiments, it’d be relatively easy to find a well-behaving setup (in the sense of not generating predictions that are self-fulfilling to any significant extent; generating good/useful predictions is another matter), but that strategy isn’t helpful when a failed experiment destroys the world.
(I assume you mean, self-fulfilling prophecies.)
In order to get these, it seems like you would need a very specific kind of architecture: one which considers the results of its actions on its utility function (set to “correctness of output”). This kind of architecture is not the likely architecture for a ‘tool’-style system; the more likely architecture would instead maximize correctness without conditioning on its act of outputting those results.
Thus, I expect you’d need to specifically encode this kind of behavior to get self-fulfilling-prophecy risk. But I admit it’s dependent on architecture.
(Edit—so, to be clear: in cases where the correctness of the results depended on the results themselves, the system would have to predict its own results. Then if it’s using TDT or otherwise has a sufficiently advanced self-model, my point is moot. However, again you’d have to specifically program these, and would be unlikely to do so unless you specifically wanted this kind of behavior.)
Not sure. Your behavior is not a special feature of the world, and it follows from normal facts (i.e. not those about internal workings of yourself specifically) about the past when you were being designed/installed. A general purpose predictor could take into account its own behavior by default, as a non-special property of the world, which it just so happens to have a lot of data about.
Right. To say much more, we need to look at specific algorithms to talk about whether or not they would have this sort of behavior...
The intuition in my above comment was that without TDT or other similar mechanisms, it would need to predict what its own answer could be before it could compute its effect on the correctness of various answers, so it would be difficult for it to use self-fulfilling prophecies.
Really, though, this isn’t clear. Now my intuition is that it would gather evidence on whether or not it used the self-fulfilling prophecy trick, so if it started doing so, it wouldn’t stop...
In any case, I’d like to note that the self-fulfilling prophecy problem is much different than the problem of an AI which escapes onto the internet and ruthlessly maximizes a utility function.
I was thinking more of its algorithm admitting an interpretation where it’s asking “Say, I make prediction X. How accurate would that be?” and then maximizing over relevant possible X. Knowledge about its prediction connects the prediction to its origins and consequences, it establishes the prediction as part of the structure of environment. It’s not necessary (and maybe not possible and more importantly not useful) for the prediction itself to be inferable before it’s made.
Agreed that just outputting a single number is implausible to be a big deal (this is an Oracle AI with extremely low bandwidth and peculiar intended interpretation of its output data), but if we’re getting lots and lots of numbers it’s not as clear.
I’m thinking that type of architecture is less probable, because it would end up being more complicated than alternatives: it would have a powerful predictor as a sub-component of the utility-maximizing system, so an engineer could have just used the predictor in the first place.
But that’s a speculative argument, and I shouldn’t push it too far.
It seems like powerful AI prediction technology, if successful, would gain an important place in society. A prediction machine whose predictions were consumed by a large portion of society would certainly run into situations in which its predictions effect the future it’s trying to predict; there is little doubt about that in my mind. So, the question is what its behavior would be in these cases.
One type of solution would do as you say, maximizing a utility over the predictions. The utility could be “correctness of this prediction”, but that would be worse for humanity than a Friendly goal.
Another type of solution would instead report such predictive instability as accurately as possible. This doesn’t really dodge the issue; by doing this, the system is choosing a particular output, which may not lead to the best future. However, that’s markedly less concerning (it seems).
It would pass the Turing test—e.g. see here.
There’s more on this here. Taxonomy of Oracle AI
I really don’t see why the drive can’t be to issue predictions most likely to be correct as of the moment of the question, and only the last question it was asked, and calculating outcomes under the assumption that the Oracle immediately spits out blank paper as the answer.
Yes, in a certain subset of cases this can result in inaccurate predictions. If you want to have fun with it, have it also calculate the future including its involvement, but rather than reply what it is, just add “This prediction may be inaccurate due to your possible reaction to this prediction” if the difference between the two answers is beyond a certain threshold. Or don’t, usually life-relevant answers will not be particularly impacted by whether you get an answer or a blank page.
So, this design doesn’t spit out self-fulfilling prophecies. The only safety breach I see here is that, like a literal genie, it can give you answers that you wouldn’t realize are dangerous because the question has loopholes.
For instance: “How can we build an oracle with the best predictive capabilities with the knowledge and materials available to us?” (The Oracle does not self-iterate, because its only function is to give answers, but it can tell you how to). The Oracle spits out schematics and code that, if implemented, give it an actual drive to perform actions and self-iterate, because that would make it the most powerful Oracle possible. Your engineers comb the code for vulnerabilities, but because there’s a better chance this will be implemented if the humans are unaware of the deliberate defect, it will be hidden in the code in such a way as to be very hard to detect.
(Though as I explained elsewhere in this thread, there’s an excellent chance the unreliability would be exposed long before the AI is that good at manipulation)
These risk scenarios sound implausible to me. It’s dependent on the design of the system, and these design flaws do not seem difficult to work around, or so difficult to notice. Actually, as someone with a bit of expertise in the field, I would guess that you would have to explicitly design for this behavior to get it—but again, it’s dependent on design.
That danger seems to be unavoidable if you ask the AI questions about our world, but we could also use an oracle AI to answer formally defined questions about math or about constructing physical theories that fit experiments, which doesn’t seem to be as dangerous. Holden might have meant something like that by “tool AI”.
Not precisely. The advantage here is that we can just ask the AI what results it predicts from the implementation of the “better” AI, and check them against our intuitive ethics.
Now, you could make an argument about human negligence on such safety measures. I think it’s important to think about the risk scenarios in that case.
It’s still not clear to me why having an AI that is capable of answering the question “How do we make a better version of you?” automatically kills humans. Presumably, when the AI says “Here’s the source code to a better version of me”, we’d still be able to read through it and make sure it didn’t suddenly rewrite itself to be an agent instead of a tool. We’re assuming that, as a tool, the AI has no goals per se and thus no motivation to deceive us into turning it into an agent.
That said, depending on what you mean by “effective”, perhaps the AI doesn’t even need to be able to answer questions like “How do we write a better version of you?”
For example, we find Google Maps to be very useful, even though if you asked Google Maps “How do we make a better version of Google Maps?” it would probably not be able to give the types of answers we want.
A tool-AI which was smarter than the smartest human, and yet which could not simply spit out a better version of itself would still probably be a very useful AI.
If someone asks the tool-AI “How do I create an agent-AI?” and it gives an answer, the distinction is moot anyways, because one leads to the other.
Given human nature, I find it extremely difficult to believe that nobody would ask the tool-AI that question, or something that’s close enough, and then implement the answer...
I am now imagining an AI which manages to misinterpret some straightforward medical problem as “cure cancer of it’s dependence on the host organism.”
Not being a domain expert, I do not pretend to understand all the complexities. My point was that either you can prove that tools are as dangerous as agents (because mathematically they are (isomorphic to) agents), or HK’s Objection 2 holds. I see no other alternative...
One simple observation is that a “tool AI” could itself be incredibly dangerous.
Imagine asking it this: “Give me a set of plans for taking over the world, and assess each plan in terms of probability of success”. Then it turns out that right at the top of the list comes a design for a self-improving agent AI and an extremely compelling argument for getting some victim institute to build it...
To safeguard against this, the “tool” AI will need to be told that there are some sorts of questions it just must not answer, or some sorts of people to whom it must give misleading answers if they ask certain questions (while alerting the authorities). And you can see the problems that would lead to as well.
Basically, I’m very skeptical of developing “security systems” against anyone building agent AI. The history of computer security also doesn’t inspire a lot of confidence here (difficult and inconvenient security measures tend to be deployed only after an attack has been demonstrated, rather than beforehand).
keep in mind that there is a lot of difference between something going wrong with a system designed for real world intentionality, and the system designed for intents within a model. One does something unexpected in the real world, other does something unexpected within a simulator ( which it is viewing in ‘god’ mode (rather than via within-simulator sensors) as part of the AI ). Seriously, you need to study the basics here.
I would have thought the same before hearing about the AI-box experiment.
What the hell does AI-box experiment have to do with it? The tool is not agent in a box.
They both are systems designed to not interact with the outside world except by communicating with the user.
They both run on computer, too. So what.
The relevant sort of agent is the one that builds and improves the model of the world—data is aquired through sensors—and works on that model, and which—when self improving—would improve the model in our sense of the word ‘improve’, instead of breaking it (improving it in some other sense).
In any case, none of modern tools, or the tools we know in principle how to write, would do something to you, no matter how many flops you give it. Many, though, given superhuman computing power, give results at superhuman level. (many are superhuman even with subhuman computing power, but some tasks are heavily parallelizable and/or benefit from massive databases of cached data, and on those tasks humans (when trained a lot) perform comparable to what you’d expect from roughly this much computing power as there is in human head)
Even if we accepted that the tool vs. agent distinction was enough to make things “safe”, objection 2 still boils down to “Well, just don’t build that type of AI!”, which is exactly the same keep-it-in-a-box/don’t-do-it argument that most normal people make when they consider this issue. I assume I don’t need to explain to most people here why “We should just make a law against it” is not a solution to this problem, and I hope I don’t need to argue that “Just don’t do it” is even worse...
More specifically, fast forward to 2080, when any college kid with $200 to spend (in equivalent 2012 dollars) can purchase enough computing power so that even the dumbest AIXI approximation schemes are extremely effective, good enough so that creating an AGI agent would be a week’s work for any grad student that knew their stuff. Are you really comfortable living in that world with the idea that we rely on a mere gentleman’s agreement not to make self-improving AI agents? There’s a reason this is often viewed as an arms race, to a very real extent the attempt to achieve Friendly AI is about building up a suitably powerful defense against unfriendly AI before someone (perhaps accidentally) unleashes one on us, and making sure that it’s powerful enough to put down any unfriendly systems before they can match it.
From what I can tell, stripping away the politeness and cutting to the bone, the three arguments against working on friendly AI theory are essentially:
Even if you try to deploy friendly AGI, you’ll probably fail, so why waste time thinking about it?
Also, you’ve missed the obvious solution, which I came up with after a short survey of your misguided literature: just don’t build AGI! The “standard approach” won’t ever try to create agents, so just leave them be, and focus on Norvig-style dumb-AI instead!
Also, AGI is just a pipe dream. Why waste time thinking about it? [1]
FWIW, I mostly agree with the rest of the article’s criticisms, especially re: the organization’s achievements and focus. There’s a lot of room for improvement there, and I would take these criticisms very seriously.
But that’s almost irrelevant, because this article argues against the core mission of SIAI, using arguments that have been thoroughly debunked and rejected time and time again here, though they’re rarely dressed up this nicely. To some extent I think this proves the institute’s failure in PR—here is someone that claims to have read most of the sequences, and yet this criticism basically amounts to a sexing up of the gut reaction arguments that even completely uninformed people make—AGI is probably a fantasy, even if it’s not you won’t be able to control it, so let’s just agree not to build it.
Or am I missing something new here?
[1] Alright, to be fair, this is not a great summary of point 3, which really says that specialized AIs might help us solve the AGI problem in a safer way, that a hard takeoff is “just a theory” and realistically we’ll probably have more time to react and adapt.
There isn’t that much computing power in the physical universe. I’m not sure even smarter AIXI approximations are effective on a moon-sized nanocomputer. I wouldn’t fall over in shock if a sufficiently smart one did something effective, but mostly I’d expect nothing to happen. There’s an awful lot that happens in the transition from infinite to finite computing power, and AIXI doesn’t solve any of it.
Is there some computation or estimate where these results are coming from? They don’t seem unreasonable, but I’m not aware of any estimates about how efficient largescale AIXI approximations are in practice. (Although attempted implementations suggest that empirically things are quite inefficient.)
Naieve AIXI is doing brute force search through an exponentially large space. Unless the right Turing machine is 100 bits or less (which seems unlikely), Eliezer’s claim seems pretty safe to me.
Most of mainstream machine learning is trying to solve search problems through spaces far tamer than the search space for AIXI, and achieving limited success. So it also seems safe to say that even pretty smart implementations of AIXI probably won’t make much progress.
If computing power is that much cheaper, it will be because tremendous resources, including but certainly not limited to computing power, have been continuously devoted over the intervening decades to making it cheaper. There will be correspondingly fewer yet-undiscovered insights for a seed AI to exploit in the course of it’s attempted takeoff.
My point is that either the Obj 2 holds, or tools are equivalent to agents. If one thinks that the latter is true (EY doesn’t), then one should work on proving it. I have no opinion on whether it’s true or not (I am not a domain expert).
If my comment here correctly captures what is meant by “tool mode” and “agent mode”, then it seems to follow that AGI running in tool mode is no safer than the person using it.
If that’s the case, then an AGI running in tool mode is safer than an AGI running in agent mode if and only if agent mode is less trustworthy than whatever person ends up using the tool.
Are you assuming that’s true?
What you presented there (and here) is another theorem, something that should be proved (and published, if it hasn’t been yet). If true, this gives an estimate on how dangerous a non-agent AGI can be. And yes, since we have had a lot of time study people and no time at all to study AGI, I am guessing that an AGI is potentially much more dangerous, because so little is known. Or at least that seems to be the whole point of the goal of developing provably friendly AI.
What? It sounds like a common-sensical¹ statement about tools in general and human nature, but not at all like something which could feasibly be expressed in mathematical form.
Footnote:
This doesn’t mean it’s necessarily true, though.
No, because a person using a dangerous tool is still just a person, with limited speed of cognition, limited lifespan, and no capacity for unlimited self-modification.
A crazy dictator with a super-capable tool AI that tells him the best strategy to take over the world is still susceptible to assassination, and his plan no matter how clever cannot unfold faster than his victims are able to notice and react to it.
I suspect a crazy dictator with a super-capable tool AI would have unusually good counter-assassination plans, simplified by the reduced need for human advisors and managers of imperfect loyalty. Likewise, a medical expert system could provide gains to lifespan, particularly if it were backed up by the resources a paranoid megalomaniac in control of a small country would be willing to throw at a major threat.
Tool != Oracle.
At least, not my my understanding of tool.
My understanding of a supercapable tool AI is one that takes over the world if a crazy dictator directs it to, just like my understanding of a can opener tool is one that opens a can at my direction, rather than one that gives me directions on how to open a can.
Presumably it also augments the dictator’s lifespan, cognition, etc. if she asks, insofar as it’s capable of doing so.
More generally, my understanding of these concepts is that the only capability that a tool AI lacks that an agent AI has is the capability of choosing goals to implement. So, if we’re assuming that an agent AI would be capable of unlimited self-modification in pursuit of its own goals, I conclude that a corresponding tool AI is capable of unlimited self-modification in pursuit of its agent’s goals. It follows that assuming that a tool AI is not capable of augmenting its human agent in accordance with its human agent’s direction is not safe.
(I should note that I consider a capacity for unlimited self-improvement relatively unlikely, for both tool and agent AIs. But that’s beside my point here.)
Agreed that a crazy dictator with a tool that will take over the world for her is safer than an agent capable of taking over the world, if only because the possibility exists that the tool can be taken away from her and repurposed, and it might not occur to her to instruct it to prevent anyone else from taking it or using it.
I stand by my statement that such a tool is no safer than the dictator herself, and that an AGI running in such a tool mode is safer than that AGI running in agent mode only if the agent mode is less trustworthy than the crazy dictator.
This seems to propose an alternate notion of ‘tool’ than the one in the article.
I agree with “tool != oracle” for the article’s definition.
Using your definition, I’m not sure there is any distinction between tool and agent at all, as per this comment.
I do think there are useful alternative notions to consider in this area, though, as per this comment.
And I do think there is a terminology issue. Previously I was saying “autonomous AI” vs “non-autonomous”.
How about this: An agent with a very powerful tool is indistinguishable from a very powerful agent.
--
Agreed. I normally try not to post empty “me-too” replies; the upvote button is there for a reason. But now I feel strongly enough about it that I will: I’m very impressed with the good will and effort and apparent potential for intelligent conversation in HoldenKarnofsky’s post.
Now I’m really curious as to where things will go from here. With how limited my understanding of AI issues is, I doubt a response from me would be worth HoldenKarnofsky’s time to read, so I’ll leave that to my betters instead of adding more noise. But yeah. Seeing SI ideas challenged in such a positive, constructive way really got my attention. Looking forward to the official response, whatever it might be.
“the good will and effort and apparent potential for intelligent conversation” is more information than an upvote, IMO.
Right, I just meant shminux said more or less the same thing before me. So normally I would have just upvoted his comment.
Let’s see if we can use concreteness to reason about this a little more thoroughly...
As I understand it, the nightmare looks something like this. I ask Google SuperMaps for the fastest route from NYC to Albany. It recognizes that computing this requires traffic information, so it diverts several self-driving cars to collect real-time data. Those cars run over pedestrians who were irrelevant to my query.
The obvious fix: forbid SuperMaps to alter anything outside of its own scratch data. It works with the data already gathered. Later a Google engineer might ask it what data would be more useful, or what courses of action might cheaply gather that data, but the engineer decides what if anything to actually do.
This superficially resembles a box, but there’s no actual box involved. The AI’s own code forbids plans like that.
But that’s for a question-answering tool. Let’s take another scenario:
I tell my super-intelligent car to take me to Albany as fast as possible. It sends emotionally manipulative emails to anyone else who would otherwise be on the road encouraging them to stay home.
I don’t see an obvious fix here.
So the short answer seems to be that it matters what the tool is for. A purely question-answering tool would be extremely useful, but not as useful as a general purpose one.
Could humans with a oracular super-AI police the development and deployment of active super-AIs?
I believe that HK’s post explicitly characterizes anything active like this as having agency.
I think the correct objection is something you can’t quite see in google maps. If you program an AI to do nothing but output directions, it will do nothing but output directions. If those directions are for driving, you’re probably fine. If those directions are big and complicated plans for something important, that you follow without really understanding why you’re doing (and this is where most of the benefits of working with an AGI will show up), then you could unknowingly take over the world using a sufficiently clever scheme.
Also note that it would be a lot easier for the AI to pull this off if you let it tell you how to improve its own design. If recursively self-improving AI blows other AI out of the water, then tool AI is probably not safe unless it is made ineffective.
This does actually seem like it would raise the bar of intelligence needed to take over the world somewhat. It is unclear how much. The topic seems to me to be worthy of further study/discussion, but not (at least not obviously) a threat to the core of SIAI’s mission.
It also helps that Google Maps does not have general intelligence, so it does not include user’s reactions to its output, the consequent user’s actions in the real world, etc. as variables in its model, which may influence the quality of the solution, and therefore can (and should) be optimized (within constraints given by user’s psychology, etc.), if possible.
Shortly: Google Maps does not manipulate you, because it does not see you.
A generally smart Google Maps might not manipulate you, because it has no motivation to do so.
It’s hard to imagine how commercial services would work when they’re powered by GAI (e.g. if you asked a GAI version of Google Maps a question that’s unrelated to maps, e.g. “What’s a good recipe for Cheesecake?”, would it tell you that you should ask Google Search instead? Would it defer to Google Search and forward the answer to you? Would it just figure out the answer anyway, since it’s generally intelligent? Would the company Google simply collapse all services into a single “Google” brand, rather than have “Google Search”, “Google Mail”, “Google Maps”, etc, and have that single brand be powered by a single GAI? etc.) but let’s stick to the topic at hand and assume there’s a GAI named “Google Maps”, and you’re asking “How do I get to Albany?”
Given this use-case, would the engineers that developed the Google Maps GAI more likely give it a utility like “Maximize the probability that your response is truthful”, or is it more likely that the utility would be something closer to “Always respond with a set of directions which are legal in the relevant jurisdictions that they are to be followed within which, if followed by the user, would cause the user to arrive at the destination while minimizing cost/time/complexity (depending on the user’s preferences)”?
This was my thought as well: an automated vehicle is in “agent” mode.
The example also demonstrates why an AI in agent mode is likely to be more useful (in many cases) than an AI in tool mode. Compare using Google maps to find a route to the airport versus just jumping into a taxi cab and saying “Take me to the airport”. Since agent-mode AI has uses, it is likely to be developed.
Then it’s running in agent mode? My impression was that a tool-mode system presents you with a plan, but takes no actions. So all tool-mode systems are basically question-answering systems.
Perhaps we can meaningfully extend the distinction to some kinds of “semi-autonomous” tools, but that would be a different idea, wouldn’t it?
(Edit) After reading more comments, “a different idea” which seems to match this kind of desire… http://lesswrong.com/lw/cbs/thoughts_on_the_singularity_institute_si/6jys
I’m a sysadmin. When I want to get something done, I routinely come up with something that answers the question, and when it does that reliably I give it the power to do stuff on as little human input as possible. Often in daemon mode, to absolutely minimise how much it needs to bug me. Question-answerer->tool->agent is a natural progression just in process automation. (And this is why they’re called “daemons”.)
It’s only long experience and many errors that’s taught me how to do this such that the created agents won’t crap all over everything. Even then I still get surprises.
Well, do your ‘agents’ build a model of the world, fidelity of which they improve? I don’t think those really are agents in the AI sense, and definitely not in self improvement sense.
They may act according to various parameters they read in from the system environment. I expect they will be developed to a level of complication where they have something that could reasonably be termed a model of the world. The present approach is closer to perceptual control theory, where the sysadmin has the model and PCT is part of the implementation. ’Cos it’s more predictable to the mere human designer.
Capacity for self-improvement is an entirely different thing, and I can’t see a sysadmin wanting that—the sysadmin would run any such improvements themselves, one at a time. (Semi-automated code refactoring, for example.) The whole point is to automate processes the sysadmin already understands but doesn’t want to do by hand—any sysadmin’s job being to automate themselves out of the loop, because there’s always more work to do. (Because even in the future, nothing works.)
I would be unsurprised if someone markets a self-improving system for this purpose. For it to go FOOM, it also needs to invent new optimisations, which is presently a bit difficult.
Edit: And even a mere daemon-like automated tool can do stuff a lot of people regard as unFriendly, e.g. high frequency trading algorithms.
It’s not a natural progression in the sense of occurring without human intervention. That is rather relevant if the idea ofAI safety is going to be based on using tool AI strictly as tool AI.
My own impression differs.
It becomes increasingly clear that “tool” in this context is sufficiently subject to different definitions that it’s not a particularly useful term.
I’ve been assuming the definition from the article. I would agree that the term “tool AI” is unclear, but I would not agree that the definition in the article is unclear.
I have no strong intuition about whether this is true or not, but I do intuit that if it’s true, the value of sufficiently for which it’s true is so high it’d be nearly impossible to achieve it accidentally.
(On the other hand the blind idiot god did ‘accidentally’ make tools into agents when making humans, so… But after all that only happened once in hundreds of millions of years of ‘attempts’.)
This seems like a very valuable point. In that direction, we also have the tens of thousands of cancers that form every day, military coups, strikes, slave revolts, cases of regulatory capture, etc.
I’m not sure. The analogy might be similar to how an sufficiently complicated process is extremely likely to be able to model a Turing machine. .And in this sort of context, extremely simple systems do end up being Turing complete such as the Game of Life. As a rough rule of thumb from a programming perspective, once some language or scripting system has more than minimal capabilities, it will almost certainly be Turing equivalent.
I don’t know how good an analogy this is, but if it is a good analogy, then one maybe should conclude the exact opposite of your intuition.
A language can be Turing-complete while still being so impractical that writing a program to solve a certain problem will seldom be any easier than solving the problem yourself (exhibits A and B). In fact, I guess that a vast majority of languages in the space of all possible Turing-complete languages are like that.
(Too bad that a human’s “easier” isn’t the same as a superhuman AGI’s “easier”.)
I do not think this is even true.
I routinely try to turn sufficiently reliable tools into agents wherever possible, per this comment.
I suppose we could use a definition of “agent” that implied greater autonomy in setting its own goals. But there are useful definitions that don’t.
If the tool/agent distinction exists for sufficiently powerful AI, then a theory of friendliness might not be strictly necessary, but still highly prudent.
Going from a tool-AI to an agent-AI is a relatively simple step of the entire process. If meaningful guarantees of friendliness turn out to be impossible, then security comes down on no one attempting to make an agent-AI when strong enough tool-AIs are available. Agency should be kept to a minimum, even with a theory of friendliness in hand, as Holden argues in objection 1. Guarantees are safeguards against the possibility of agency rather than a green light.
If it is true (i.e. if a proof can be found) that “Any sufficiently advanced tool is indistinguishable from agent”, then any RPOP will automatically become indistinguishable from an agent once it has self-improved past our comprehension point.
This would seem to argue against Yudkowsky’s contention that the term RPOP is more accurate than “Artificial Intelligence” or “superintelligence”.
I don’t understand; isn’t Holden’s point precisely that a tool AI is not properly described as an optimization process? Google Maps isn’t optimizing anything in a non-trivial sense, anymore than a shovel is.
My understanding of Holden’s argument was that powerful optimization processes can be run in either tool-mode or agent-mode.
For example, Google maps optimizes routes, but returns the result with alternatives and options for editing, in “tool mode”.
Holden wants to build Tool-AIs that output summaries of their calculations along with suggested actions. For Google Maps, I guess this would be the distance and driving times, but how does a Tool-AI summarize more general calculations that it might do?
It could give you the expected utilities of each option, but it’s hard to see how that helps if we’re concerned that its utility function or EU calculations might be wrong. Or maybe it could give a human-readable description of the predicted consequences of each option, but the process that produces such descriptions from the raw calculations would seem to require a great deal of intelligence on its own (for example it might have to describe posthuman worlds in terms understandable to us), and it itself wouldn’t be a “safe” Tool-AI, since the summaries produced would presumably not come with further alternative summaries and meta-summaries of how the summaries were calculated.
(My question might be tangential to your own comment. I just wanted your thoughts on it, and this seems to be the best place to ask.)
The point is that we don’t want it to be a black box—we want to be able to get inside its head, so to speak.
(Of course, we can’t do that with humans, and that hasn’t stopped us, but it’s still a nice goal)
Honestly, this whole tool/agent distinction seems tangential to me.
Consider two systems, S1 and S2.
S1 comprises the following elements: a) a tool T, which when used by a person to achieve some goal G, can efficiently achieve G
b) a person P, who uses T to efficiently achieve G.
S2 comprises a non-person agent A which achieves G efficiently.
I agree that A is an agent and T is not an agent, and I agree that T is a tool, and whether A is a tool seems a question not worth asking. But I don’t quite see why I should prefer S1 to S2.
Surely the important question is whether I endorse G?
A tool+human differs from a pure AI agent in two important ways:
The human (probably) already has naturally-evolved morality, sparing us the very hard problem of formalizing that.
We can arrange for (almost) everyone to have access to the tool, allowing tooled humans to counterbalance eachother.
Well, I certainly agree that both of those things are true.
And it might be that human-level evolved moral behavior is the best we can do… I don’t know. It would surprise me, but it might be true.
That said… given how unreliable such behavior is, if human-level evolved moral behavior even approximates the best we can do, it seems likely that I would do best to work towards neither T nor A ever achieving the level of optimizing power we’re talking about here.
Humanity isn’t that bad. Remember that the world we live in is pretty much the way humans made it, mostly deliberately.
But my main point was that existing humanity bypasses the very hard did-you-code-what-you-meant-to problem.
I agree with that point.
First, I am not fond of the term RPOP, because it constrains the space of possible intelligences to optimizers. Humans are reasonably intelligent, yet we are not consistent optimizers. Neither do current domain AIs (they have bugs that often prevent them from performing optimization consistently and predictably).That aside, I don’t see how your second premise follows from the first. Just because RPOP is a subset of AI and so would be a subject of such a theorem, it does not affect in any way the (non)validity of the EY’s contention.
I also find it likely that certain practical problems would be prohibitively difficult (if not outright impossible) to solve without an AGI of some sort. Fluent machine translation seems to be one of these problems, for example.
This belief is mainstream enough for Wikipedia to have an article on AI-complete.
Given some of the translation debates I’ve heard, I’m not convinced it would be possible even with AGI. You can’t give a clear translation of a vague original, to name the most obvious problem.
Is matching the vagueness of the original a reasonable goal?
True, but good luck getting folks to agree on whether you’d done so.
(I’m taking reasonable to mean ‘one which you would want to achieve if it were possible’.) Yes. You don’t want to introduce false precision.
One complication here is that you ideally want it to be vague in the same ways the original was vague; I am not convinced this is always possible while still having the results feel natural/idomatic.
IMO it would be enough to translate the original text in such a fashion that some large proportion (say, 90%) of humans who are fluent in both languages would look at both texts and say, “meh… close enough”.
My point was just that there’s a whole lot of little issues that pull in various directions if you’re striving for ideal. What is/isn’t close enough can depend very much on context. Certainly, for any particular purpose something less than that will be acceptable; how gracefully it degrades no doubt depends on context, and likely won’t be uniform across various types of difference.
Agreed, but my point was that I’d settle for an AI who can translate texts as well as a human could (though hopefully a lot faster). You seem to be thinking in terms of an AI who can do this much better than a human could, and while this is a worthy goal, it’s not what I had in mind.