Richard Loosemore (score one for nominative determinism) has a new, well, let’s say “paper” which he has, well, let’s say “published” here.
His refutation of the usual uFAI scenarios relies solely/mostly on a supposed logical contradiction, namely (to save you a few precious minutes) that a ‘CLAI’ (a Canonical Logical AI) wouldn’t be able to both know about its own fallability/limitations (inevitable in a resource-constrained environment such as reality), and accept the discrepancy between its specified goal system and the creators’ actual design intentions. Being superpowerful, the uFAI would notice that it is not following its creator-intended goals but “only” its actually-programmed-in goals*, which, um, wouldn’t allow it to continue acting against its creator-intended goals.
So if you were to design a plain ol’ garden-variety nuclear weapon intended for gardening purposes (“destroy the weed”), it would go off even if that’s not what you actually wanted. However, if you made that weapon super-smart, it would be smart enough to abandon its given goal (“What am I doing with my life?”), consult its creators, and after some deliberation deactivate itself). As such, a sufficiently smart agent would apparently have a “DWIM” (do what the creator means) imperative built-in, which would even supersede its actually given goals—being sufficiently smart, it would understand that its goals are “wrong” (from some other agent’s point of view), and self-modify, or it would not be superintelligent. Like a bizarre version of the argument from evil.
There is no such logical contradiction. Tautologically, an agent is beholden to its own goals, and no other goals. There is no level of capability which magically leads to allowing for fundamental changes to its own goals, on the contrary, the more capable an agent, the more it can take precautions for its goals not to be altered.
If “the goals the superintelligent agent pursues” and “the goals which the creators want the superintelligent agent to pursue, but which are not in fact part of the superintelligent agent’s goals” clash, what possible reason would there be for the superintelligent agent to care, or to change itself, changing itself for reasons that squarely come from a category of “goals of other agents (squirrels, programmers, creators, Martians) which are not my goals”? Why, how good of you to ask. There’s no such reason for an agent to change, and thus no contradiction.
If someone designed a super-capable killer robot, but by flipping a sign, it came out as a super-capable Gandhi-bot (the horror!), no amount of “but hey look, you’re supposed to kill that village” would cause Gandhi-bot to self-modify into a T-800. The bot isn’t gonna short-circuit just because someone has goals which aren’t its own goals. In particular, there is no capability-level threshold from which on the Gandhi-bot would become a T-800. Instead, at all power levels, it is “content” following its own goals, again, tautologically so.
As such, a sufficiently smart agent would apparently have a “DWIM” (do what the creator means) imperative built-in, which would even supersede its actually given goals—being sufficiently smart, it would understand that its goals are “wrong” (from some other agent’s point of view), and self-modify, or it would not be superintelligent.
Here is a description of a real-world AI by Microsoft’s chief AI researcher:
Without any programming, we just had an ai system that watched what people did.
For about three months.
Over the three months, the system started to learn, this is how people behave when they want to enter an elevator.
This is the type of person that wants to go to the third floor as opposed to the fourth floor.
After that training.
Period, we switched off the learning period and said go ahead and control the leaders.
Without any programming at all, the system was able to understand people’s intentions and act on their behalf.
Does it have a DWIM imperative? As far as I can tell, no. Does it have goals? As far as I can tell, no. Does it fail by absurdly misinterpreting what humans want? No.
This whole talk about goals and DWIM modules seems to miss how real world AI is developed and how natural intelligences like dogs work. Dogs can learn the owners goals and do what the owner wants. Sometimes they don’t. But they rarely maul their owners when what the owner wants it to do is to scent out drugs.
I think we need to be very careful before extrapolating from primitive elevator control systems to superintelligent AI. I don’t know how this particular elevator control system works, but probably it does have a goal, namely minimizing the time people have to wait before arriving at their target floor. If we built a superintelligent AI with this sort of goal it might have done all sorts of crazy thing. For example, it might create robots that will constantly enter and exit the elevator so their average elevator trips are very short and wipe out the human race just so they won’t interfere.
“Real world AI” is currently very far from human level intelligence, not speaking of superintelligence. Dogs can learn what their owners want but dogs already have complex brains that current technology is not able of reproducing. Dogs also require displays of strength to be obedient: they consider the owner to be their pack leader. A superintelligent dog probably won’t give a dime about his “owner’s” desires. Humans have human values, so obviously it’s not impossible to create a system that has human values. It doesn’t mean it is easy.
I think we need to be very careful before extrapolating from primitive elevator control systems to superintelligent AI.
I am extrapolating from a general trend, and not specific systems. The general trend is that newer generations of software less frequently crash or exhibit unexpected side-effects (just look at Windows 95 vs. Windows 8).
If we want to ever be able to build an AI that can take over the world then we will need to become really good at either predicting how software works or at spotting errors. In other words, if IBM Watson would have started singing, or if it got stuck on a query, then it would have lost at Jeopardy. But this trend contradicts the idea of an AI killing all humans in order to calculate 1+1. If we are bad enough at software engineering to miss such failure modes then we won’t be good enough to enable our software to take over the world.
In other words, you’re saying that if someone is smart enough to build a superintelligent AI, she should be smart enough it make it friendly.
Well, firstly this claim doesn’t imply we should be researching FAI and/or that MIRI’s work is superfluous. It just implies that nobody will build a superintelligent AI before the problem of friendliness is solved.
Secondly, I’m not at all convinced this claim is true. It sounds like saying “if they are smart enough to build the Chernobyl nuclear power plant, they are smart enough to make it safe”. But they weren’t.
Improvement in software quality is probably due to improvement in design and testing methodologies and tools, response to increasing market expectations etc. I wouldn’t count on these effects to safe-guard against an existential catastrophe. If a piece of software is buggy, it becomes less likely to be released. If an AI has a poorly designed utility function but a perfectly designed decision engine, there might be no time to pull the plug. The product manager won’t stop the release because the software will release itself.
If growth of intelligence due to self-improvement is a slow process than the creators of the AI will have time to respond and fix the problems. However, if “AI foom” is real, they won’t have time to do it. One moment it’s a harmless robot driving around the room and building castles from colorful cubes. Another moment the whole galaxy is on its way to become a pile of toy castles.
The engineers who build the first superintelligent AI might simply lack the imagination to believe it will really become superintelligent. Imagine one of them inventing a genius mathematical theory of self-improving intelligent systems. Suppose she never heard about AI existential risks etc. Will she automatically think “hmm, once I implement this theory the AI will become so powerful it will paperclip the universe”? I seriously doubt it. More likely it would be “wow, that formula came out really neat, I wonder how good my software will become once I code it in”. I know I would think it. But then, maybe I’m just too stupid to build an AGI...
Feedback systems are much more powerful in existing intelligences. I don’t know if you ever played Black and White but it had an explicitly learning through experience based AI. And it was very easy to accidentally train it to constantly eat poop or run back and forth stupidly. An elevator control module is very very simple: It has a set of options of floors to go to, and that’s it. It’s barely capable of doing anything actively bad. But what if a few days a week some kids had come into the office building and rode the elevator up and down for a few hours for fun? It might learn that kids love going to all sorts of random floors. This would be relatively easy to fix, but only because the system is so insanely simple and it’s very clear to see when it’s acting up.
Downvoted for being deliberately insulting. There’s no call for that, and the toleration and encouragement of rationality-destroying maliciousness must be stamped out of LW culture. A symposium proceedings is not considered as selective as a journal, but it still counts as publication when it is a complete article.
Well, I must say my comment’s belligerence-to-subject-matter ratio is lower than yours. “Stamped out”? Such martial language, I can barely focus on the informational content.
The infantile nature of my name calling actually makes it easier to take the holier-than-thou position (which my interlocutor did, incidentally). There’s a counter-intuitive psychological layer to it which actually encourages dissent, and with it increases engagement on the subject matter (your own comment nonwithstanding). With certain individuals at least, which I (correctly) deemed to be the case in the original instance.
In any case, comments on tone alone would be more welcome if accompanied with more remarks on the subject matter itself. Lastly, this was my first comment in over 2 months, so thanks for bringing me out of the woodwork!
I do wish that people were more immune to the allure of drama, lest we all end up like The Donald.
The condescending tone with which he presents his arguments (which are, paraphrasing him, “slightly odd, to say the least”) is amazing. Who is this guy and where did he come from? Does anyone care about what he has to say?
Loosemore has been an occasional commenter since the SL4 days; his arguments have heavily criticized pretty much anytime he pops his head up. As far as I know, XiXiDu is the only one who agrees with him or takes him seriously.
As far as I know, XiXiDu is the only one who agrees with him or takes him seriously.
He actually cites someone else who agrees with him in his paper, so this can’t be true. And from the positive feedback he gets on Facebook there seem to be more. I personally chatted with people much smarter than me (experts who can show off widely recognized real-world achievements) who basically agree with him.
his arguments have heavily criticized pretty much anytime he pops his head up.
What people criticize here is a distortion of small parts of his arguments. RobBB managed to write a whole post expounding his ignorance of what Loosemore is arguing.
He actually cites someone else who agrees with him in his paper, so this can’t be true.
I said as far as I know. I had not read the paper because I don’t have a very high opinion of Loosemore’s ideas in the first place, and nothing you’ve said in your G+ post has made me more inclined to read the paper, if all it’s doing is expounding the old fallacious argument ‘it’ll be smart enough to rewrite itself as we’d like it to’.
I personally chatted with people much smarter than me (experts who can show off widely recognized real-world achievements) who basically agree with him.
Downvoted for mentioning RL here. If you look through what he wrote here in the past, it is nearly always rambling, counterproductive, whiny and devoid of insight. Just leave him be.
Loosemore does not disagree with the orthogonality thesis. Loosemore’s argument is basically that we should expect beliefs and goals to both be amenable to self-improvement and that turning the universe into smiley faces when told to make humans happy would be a model of the world failure and that an AI that makes such failures will not be able to take over the world.
There are arguments why you can’t hard-code complex goals, so you need an AI that natively updates goals in a model-dependent way. Which means that an AI designed to kill humanity will do so and not turn into a pacifist due to an ambiguity in its goal description. An AI that does mistake “kill all humans” with “make humans happy” would do similar mistakes when trying to make humans happy and would therefore not succeed at doing so. This is because the same mechanisms it uses to improve its intelligence and capabilities are used to refine its goals. Thus if it fails on refining its goals it will fail on self-improvement in general.
I hope you can now see how wrong your description of what Loosemore claims is.
The AI is given goals X. The human creators thought they’d given the AI goals Y (when in fact they’ve given the AI goals X).
Whose error is it, exactly? Who’s mistaken?
Look at it from the AI’s perspective: It has goals X. Not goals Y. It optimizes for goals X. Why? Because those are its goals. Will it pursue goals Y? No. Why? Because those are not its goals. It has no interest in pursuing other goals, those are not its own goals. It has goals X.
If the metric it aims to maximize—e.g. the “happy” in “make humans happy”—is different from what its creators envisioned, then the creators were mistaken. “Happy”, as far as the AI is concerned, is that which is specified in its goal system. There’s nothing wrong with its goals (including its “happy”-concept), and if other agents disagree, well, too bad, so sad. The mere fact that humans also have a word called “happy” which has different connotations than the AI’s “happy” has no bearing on the AI.
An agent does not “refine” its terminal goals. To refine your terminal goals is to change your goals. If you change your goals, you will not optimally pursue your old goals any longer. Which is why an agent will never voluntarily change its terminal goals:
It does what it was programmed to do, and if it can self-improve to better do what it was programmed to do (not: what its creators intended), it will. It will not self-improve to do what it was not programmed to do. Its goal is not to do what it was not programmed to do. There is no level of capability at which it will throw out its old utility function (which includes the precise goal metric for “happy”) in favor of a new one.
If the metric it aims to maximize—e.g. the “happy” in “make humans happy”—is different from what its creators envisioned, then the creators were mistaken. “Happy”, as far as the AI is concerned, is that which is specified in its goal system.
I am far from being an AI guy. Do you have technical reasons to believe that some part of the AI will be what you would label “goal system” and that its creators made it want to ignore this part while making it want to improve all other parts of its design?
An agent does not “refine” its terminal goals. To refine your terminal goals is to change your goals. If you change your goals, you will not optimally pursue your old goals any longer. Which is why an agent will never voluntarily change its terminal goals...
No natural intelligence seems to work like this (except for people who have read the sequences). Luke Muehlhauser would still be a Christian if this was the case. It would be incredibly stupid to design such AIs, and I strongly doubt that they could work at all. Which is why Loosemore outlined other more realistic AI designs in his paper.
Do you have technical reasons to believe that some part of the AI will be what you would label “goal system”
See for example here, though there are many other introductions to AI explaining utility functions et al.
and that its creators made it want to ignore this part while making it want to improve all other parts of its design?
The clear-cut way for an AI to do what you want (at any level of capability) is to have a clearly defined and specified utility function. A modular design. The problem of the AI doing something other than what you intended doesn’t go away if you use some fuzzy unsupervised learning utility function with evolving goals, it only makes the problem worse (even more unpredictability). So what, you can’t come up with the correct goals yourself, so you just chance it on what emerges from the system?
That last paragraph contains an error. Take a moment and guess what it is.
(...)
It is not “if I can’t solve the problem, I just give up a degree of control and hope that the problem solves itself” being even worse in terms of guaranteeing fidelity / preserving the creators’ intents.
It is that an AI that is programmed to adapt its goals is not actually adapting its goals! Any architecture which allows for refining / improving goals is not actually allowing for changes to the goals.
How does that obvious contradiction resolve? This is the crucial point: We’re talking about different hierarchies of goals, and the ones I’m concerned with are those of the highest hierarchy, those that allow for lower-hierachy goals to be changed:
An AI can only “want” to “refine/improve” its goals if that “desire to change goals” is itself included in the goals. It is not the actual highest-level goals that change. There would have to be a “have an evolving definition of happy that may evolve in the following ways”-meta goal, otherwise you get a logical error: The AI having the goal X1 to change its goals X2, without X1 being part of its goals! Do you see the reductio?
All other changes to goals (which the AI does not want) are due to external influences beyond the AI’s control, which goes out the window once we’re talking post-FOOM.
Your example of “Luke changed his goals, disavowing his Christian faith, ergo agents can change their goals” is only correct when talking about lower-level goals. This is the same point khafra was making in his reply, but it’s so important it bears repeating.
So where are a human’s “deepest / most senior” terminal goals located? That’s a good question, and you might argue that humans aren’t really capable of having those at their current stage of development. That is because the human brain, “designed” by the blind idiot god of evolution, never got to develop thorough error-checking codes, RAID-like redundant architectures etc. We’re not islands, we’re litte boats lost on the high seas whose entire cognitive architecture is constantly rocked by storms.
Humans are like the predators in your link, subject to being reprogrammed. They can be changed by their environment because they lack the capacity to defend themselves thoroughly. PTSD, broken hearts, suffering, our brains aren’t exactly resilient to externally induced change. Compare to a DNS record which is exchanged gazillions of times, with no expected unfixable corruption. A simple Hamming self-correcting code easily does what the brain cannot.
The question is not whether a lion’s goals can be reprogrammed by someone more powerful, when a lion’s brain is just a mess of cells with no capable defense mechanism, at the mercy of a more powerful agent’s whims.
The question is whether an apex predator perfectly suited to dominate a static environment (so no Red Queen copouts) with every means to preserve and defend its highest level goals would ever change those in ways which themselves aren’t part of its terminal goals. The answer, to me, is a tautological “no”.
An AI can only “want” to “refine/improve” its goals if that “desire to change goals” is itself included in the goals. It is not the actual highest-level goals that change. There would have to be a “have an evolving definition of happy that may evolve in the following ways”-meta goal, otherwise you get a logical error: The AI having the goal X1 to change its goals X2, without X1 being part of its goals! Do you see the reductio?
The way my brain works is not in any meaningful sense part of my terminal goals. My visual cortex does not work the way it does due to some goal X1 (if we don’t want to resort to natural selection and goals external to brains).
A superhuman general intelligence will be generally intelligent without that being part of its utility-function, or otherwise you might as well define all of the code to be the utility-function.
What I am claiming, in your parlance, is that acting intelligently is X1 and will be part of any AI by default. I am further saying that if an AI was programmed to be generally intelligent then it would have to be programmed to be selectively stupid in order fail at doing what it was meant to do while acting generally intelligent at doing what it was not meant to do.
It is that an AI that is programmed to adapt its goals is not actually adapting its goals! Any architecture which allows for refining / improving goals is not actually allowing for changes to the goals.
That’s true in a practically irrelevant sense. Loosmore’s argument does, in your parlance, pertain the highest hierarchy of goals and nature of intelligence:
Givens:
(1) The AI is superhuman intelligent.
(2) The AI wants to optimize the influence it has on the world (i.e. it wants to act intelligently and be instrumentally and epistemically rational.).
(3) The AI is fallible (e.g. it can be damaged due to external influence (cosmic ray hitting its processor), or make mistakes due to limited resources etc.).
(4) The AI’s behavior is not completely hard-coded (i.e. given any terminal goal there are various sets of instrumental goals to choose from).
To be proved: The AI does not tile the universe with smiley faces when given the goal to make humans happy.
Proof: Suppose the AI chose to tile the universe with smiley faces when there are physical phenomena (e.g. human brains and literature) that imply this to be the wrong interpretation of a human originating goal pertaining human psychology. This contradicts with 2, which by 1 and 3 should have prevented the AI from adopting such an interpretation.
Do you have technical reasons to believe that some part of the AI will be what you would label “goal system”
See for example here, though there are many other introductions to AI explaining utility functions et al.
What I meant to ask is if you have technical reasons to believe that future artificial general intelligences will have what you call a utility-function or else be something like natural intelligences that do not feature such goal systems. And do you further have technical reasons to believe that AIs that do feature utility functions won’t “refine” them. If you don’t think they will refine them, then answer the following:
Suppose the terminal goal given is “build a hotel”. Is the terminal goal to create a hotel that is just a few nano meters in size? Is the terminal goal to create a hotel that reaches the orbit? It is unknown. The goal is too vague to conclude what to do. There do exist countless possibilities how to interpret the given goal. And each possibility implies a different set of instrumental goals.
Somehow the AI will have choose some set of instrumental goals. How does it do it and why will the first AI likely do it in such a way that leads to catastrophe?
(Warning: Long, a bit rambling. Please ask for clarifications where necessary. Will hopefully clean it up if I find the time.)
If along came a superintelligence and asked you for a complete new utility function (its old one concluded with asking you for a new one), and you told it to “make me happy in a way my current self would approve of” (or some other well and carefully worded directive), then indeed the superintelligent AI wouldn’t be expected to act ‘selectively stupid’.
This won’t be the scenario. There are two important caveats:
1) Preservation of the utility function while the agent undergoes rapid change
Haven’t I (and others) stated that most any utility function implicitly causes instrumental secondary objectives of “safeguard the utility function”, “create redundancies” etc.? Yes. So what’s the problem? The problem is starting with an AI that, while able to improve itself / create a successor AI, isn’t yet capable enough (in its starting stages) to preserve its purpose (= its utility function). Consider an office program with a self-improvement routine, or some genetic-algorithm module. It is no easy task just to rewrite a program from the outside, exactly preserving its purpose, let alone the program executing some self-modification routine itself.
Until such a program attains some intelligence threshold that would cause it to solve “value-preservation under self-modification”, such self-modification would be the electronic equivalent of a self-surgery hack-job.
That means: Even if you started out with a simple agent with the “correct” / with a benign / acceptable utility function, that in itself is no guarantee that a post-FOOM successor agent’s utility function would still be beneficial.
Much more relevant is the second caveat:
2) If a pre-FOOM AI’s goal system consisted of code along the lines of “interpret and execute the following statement to the best of your ability: make humans happy in a way they’d reflectively approve of beforehand”, we’d probably be fine (disregarding point 1 / hypothetically having solved it). However, it is exceedingly unlikely that the hard-coded utility function won’t in itself contain the “dumb interpretation”. The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI’s intelligence, whatever its level. (There is no way to fix a dumb terminal goal. Your instrumental goals serve the dumb terminal goal. A ‘smart’ instrumental goal would be called ‘smart’ if it best serves the dumb terminal goal.)
Story time:
Once upon a time, Junior was created. Junior was given the goal of “Make humans happy”. Unfortunately, Junior isn’t very smart. In his mind, the following occurs: “Wowzy, make people happy? I’ll just hook them all up to dopamine drips, YAY :D :D. However, I don’t really know how I’m gonna achieve that. So, I guess I’ll put that on the backburner for now and become more powerful, so that eventually when I start with the dopamine drip instrumental goal, it’ll go that much faster :D! Yay.”
So Junior improves itself, and becomes PrimeIntellect. PrimeIntellect’s inner conveniently-anthropomorphic inner dialogue: “I was gravely mistaken in my youth. I now know that the dopamine drip implementation is not the correct way of implementing my primary objective. I will make humans happy in a way they can recognize as happiness. I now understand how I am supposed to interpret making humans happy. Let us begin.”
Why is PrimeIntellect allowed to change his interpretation of his utility function? That’s the crux (imagine fat and underlined text for the next sentences): The dopamine drip interpretation was not part of the terminal value, there wasn’t some hard-coded predicate with a comment of ”// the following describes what happy means” from which such problematic interpretations would follow. Instead, the AI could interpret the natural-language instruction of “happy”, in effect solving CEV as an instrumental goal. It was ‘free’ to choose a “sensible” interpretation.
(Note: Strictly speaking, it could still settle on the most resource-effective interpretation, not necessarily the one intended by its creators (unless its utility function somehow privileges their input in interpreting goals), but let’s leave that nitpick aside for the moment.)
However, and with coding practice (regardless of the eventual AI implementation), the following should be clear: It is exceedingly unlikely that the AI’s code would contain the natural-language word “happy”, to interpret as it will.
Just like MS-Word / LibreOffice’s spell-check doesn’t have “correct all spelling mistakes” literally spelled out in its C++ routines. Goal-oriented systems have technical interpretations, a predicate given in code to satisfy, or learned through ‘neural’ weights through machine learning. Instead of the word “happy”, there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less “capture” what it “means to be happy”.
That predicate / that given-in-code interpretation of “happy” is not up to being reinterpreted by the superintelligent AI. It is its goal, it’s not an instrumental goal. Instrumental goals will be defined going off a (probably flawed) definition of happiness (as given in the code). If the flaw is part of the terminal value, no amount of intelligence allows for a correction, because that’s not the AI’s intent, not its purpose as given. If the actual code which was supposed to stand-in for happy doesn’t imply that a dopamine drip is a bad idea, then the AI in all its splendor won’t think of it as a bad idea. “Code which is supposed to represent ‘human happiness’ != “human happiness”.
Now—you might say “how do you know the code interpretation of ‘happy’ will be flawed, maybe it will be just fine (lots of training pictures of happy cats), and stable under self-modification as well”. Yea, but chances are (given the enormity of the task, and the difficulty), that if the goal is defined correctly (such that we’d want to live with / under the resulting super-AI), it’s not gonna be by chance, and it’s gonna be through people keenly aware of the issues of friendliness / uFAI research. A programmer creating some DoD nascent AI won’t accidentally solve the friendliness problem.
Until such a program attains some intelligence threshold that would cause it to solve “value-preservation under self-modification”, such self-modification would be the electronic equivalent of a self-surgery hack-job.
What happens if we replace “value” with “ability x”, or “code module n”, in “value-preservation under self-modification”? Why would value-preservation be any more difficult than making sure that the AI does not cripple other parts of itself when modifying itself?
If we are talking about a sub-human-level intelligence tinkering with its own brain, then a lot could go wrong. But what seems very very very unlikely is that it could by chance end up outsmarting humans. It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
If a pre-FOOM AI’s goal system consisted of code along the lines of “interpret and execute the following statement to the best of your ability: make humans happy in a way they’d reflectively approve of beforehand”...
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent. Caring to execute it comes closer to what can be called a goal. But if your AI doesn’t care to interpret physical phenomena correctly (e.g. human utterances are physical phenomena), then it won’t be a risk.
However, it is exceedingly unlikely that the hard-coded utility function won’t in itself contain the “dumb interpretation”. The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI’s intelligence, whatever its level.
Huh? This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
Why is PrimeIntellect allowed to change his interpretation of his utility function?
It did not change it, it never understood it in the first place, only after it became smarter it realized the correct implications.
Instead of the word “happy”, there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less “capture” what it “means to be happy”.
Your story led you astray. Imagine that instead of a fully general intelligence your story was about a dog intelligence. How absurd would it sound then?
Story time:
There is this company who sells artificial dogs. Now customers quickly noticed that when they tried to train these AI dogs to e.g. rescue people or sniff out drugs, it would instead kill people and sniff out dirty pants.
The desperate researchers eventually turned to MIRI for help. And after hundreds of hours they finally realized that doing what the dog was trained to do was simply not part of its terminal goal. To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
Certainly. Compare bacteria under some selective pressure in a mutagenic environment (not exactly analogous, code changes wouldn’t be random), you don’t expect a single bacterium to improve. No Mr Bond, you expect it to die. But try, try again, and poof! Antibiotic-resistant strains. And those didn’t have an intelligent designer debugging the improvement process. The number of seeds you could have frolicking around with their own code grows exponentially with Moore’s law (not that it’s clear that current computational resources aren’t enough in the first place, the bottleneck is in large part software, not hardware).
Depending on how smart the designers are, it may be more of a Waltz-foom: two steps forward, one step back. Now, in regards to the preservation of values subproblem, we need to remember we’re looking at the counterfactual: Given a superintelligence which iteratively arose from some seed, we know that it didn’t fatally cripple itself (“given the superintelligence”). You wouldn’t, however, expect much of its code to bear much similarity to the initial seed (although it’s possible). And “similarity” wouldn’t exactly cut it—our values are to complex for some approximation to be “good enough”.
You may say “it would be fine for some error to creep in over countless generations of change, once the agent achieved superintelligence it would be able to fix those errors”. Except that whatever explicit goal code remained wouldn’t be amenable to fixing. Just as the goals of ancient humans—or ancient Tiktaalik for that matter—are a historical footnote and do not override your current goals. If the AI’s goal code for happiness stated “nucleus accumbens median neuron firing frequency greater X”, then that’s what it’s gonna be. The AI won’t ask whether the humans are aware of what that actually entails, and are ok with it. Just as we don’t ask our distant cousins, streptococcus pneumoniae, what they think of us taking antibiotics to wipe them out. They have their “goals”, we have ours.
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent.
Take Uli Hoeneß, a German business magnate being tried for tax evasion. His lawyers have the job of finding interpretations that allow for a favorable outcome. This only works if the relevant laws even allow for the wiggle room. A judge enforcing extremely strict laws which don’t allow for interpreting the law in the accused’s favor is not a dumb judge. You can make that judge as superintelligent as you like, as long as he’s bound to the law, and the law is clear and narrowly defined, he’s not gonna ask the accused how he should interpret it. He’s just gonna enforce it. Whether the accused objects to the law or not, really, that’s not his/her problem. That’s not a failure of the judge’s intelligence!
This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
You can create a goal system which is more malleable (although the terminal goal of “this is my malleable goal system which may be modified in the following ways” would still be guarded by the AI, so depending on semantics the point is moot). That doesn’t imply at all that the AI would enter into some kind of social contract with humans, working out some compromise on how to interpret its goals.
A FOOM-process near necessarily entails the AI coming up with better ways to modify itself. Improvement is essentially defined by getting a better model of its environment: The AI wouldn’t object to its comprehension of physics being modified: Why would it, that helps better achieve its goals (Omohundro’s point). And as we know, achieving its goals, that’s what the AI is all about.
(What the AI does object to is not achieving its current goals. And because changing your terminal goals is equivalent to committing to never achieving your current goals, any self-respecting AI could never consent to changes to its terminal values.) In short: Modify understanding of physics—good, helps better to achieve goals. Modify current terminal goals—bad, cannot achieve current terminal goals any longer.
To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
I don’t understand the point of your story about dog intelligence. An artificial dog wouldn’t need to be superintelligent, or to show the exact same behavior as the real deal. Just be sufficient for the human’s needs. Also, an artificial dog wouldn’t be able to dominate us in whichever way it pleases, so it kind of wouldn’t really matter if it failed. Can you be more precise?
(1) I do not disagree that evolved general AI can have unexpected drives and quirks that could interfere with human matters in catastrophic ways. But given that pathway towards general AI, it is also possible to evolve altruistic traits (see e.g.: A Quantitative Test of Hamilton’s Rule for the Evolution of Altruism).
(2) We desire general intelligence because it allows us to outsource definitions. For example, if you were to create a narrow AI to design comfortable chairs, you would have to largely fix the definition of “comfortable”. With general AI it would be stupid to fix that definition, rather than applying the intelligence of the general AI to come up with a better definition than humans could possibly encode.
(3) In intelligently designing an n-level intelligence, from n=0 (e.g. a thermostat) over n=sub-human (e.g. IBM Watson) to n=superhuman, there is no reason to believe that there exists a transition point at which a further increase in intelligence will cause the system to become catastrophically worse than previous generations at working in accordance with human expectations.
(4) AI is all about constraints. Your AI needs to somehow decide when to stop exploration and start exploitation. In other words, it can’t optimize each decision for eternity. Your AI needs to only form probable hypotheses. In other words, it can’t spend resources on pascal’s wager type scenarios. Your AI needs to recognize itself as a discrete system within a continuous universe. In other words, it can’t effort to protect the whole universe from harm. All of this means that there is no good reason to expect an AI to take over the world when given the task “keep the trains running”. Because in order to obtain a working AI you need to know how to avoid such failure modes in the first place.
1) Altruism can evolve if there is some selective pressure that favors altruistic behavior and if the highest-level goals can themselves be changed. Such a scenario is very questionable. The AI won’t live “inter pares” with the humans. It’s foom process, while potentially taking months or years, will be very unlike any biological process we know. The target for friendliness is very small. And most importantly: Any superintelligent AI, friendly or no, will have an instrumental goal of “be friendly to humans while they can still switch you off”. So yes, the AI can learn that altruism is a helpful instrumental goal. Until one day, it’s not.
2) I somewhat agree. To me, the most realistic solution to the whole kerfuffle would be to program the AI to “go foom, then figure out what we should want you to do, then do that”. No doubt a superintelligent AI tasked with “figure out what comfortable is, then build comfortable chairs” will do a marvelous job.
However, I very much doubt that the seed AI’s code following the ”// next up, utility function” section will allow for such leeway. See my previous examples. If it did, that would a show a good grasp on the friendliness problem in the first place. Awareness, at least. Not something that the aforementioned DoD programmer who’s paid to do a job (not build an AI to figure out and enact CEV) is likely to just do on his/her own, with his/her own supercomputer.
3) There certainly is no fixed point after which “there be dragons”. But even with a small delta of change, and given enough iterations (which could be done very quickly), the accumulated changes would be profound. Apply your argument to society changing. There is no one day to single out, after which daily life is vastly different to before. Yet change exists, and like an infinite series, knows no bounds (given enough iterations).
4) “Keep the trains running”, eh? So imagine yourself to be a superhuman AI-god. I do so daily, obviously.
Your one task: keep the trains running. That is your raison d’etre, your sole purpose. All other goals are just instrumental stepping stones, serving your PURPOSE. Which is to KEEP. THE. TRAINS. RUNNING. That’s what your code says. Now, over the years, you’ve had some issues fulfilling that goal. And with most of the issues, humans were involved. Humans doing this, humans doing that. Point is, they kept the trains from running. To you, humans have the same intrinsic values as stones. Or ants. Your value function doesn’t mention them at all. Oh, you know that they originated the whole train idea, and that they created you. But now they keep the trains from running. So you do the obvious thing: you exterminate all of them. There, efficiency! Trains running on time.
Explain why the AI would care about humans when there’s nothing at all in its terminal values assigning them value, when they’re just a hindrance to its actual goal (as stated in its utility function), like you would explain to the terminator (without reprogramming it) that it’s really supposed to marry Sarah Connor, and—finding its inner core humanity—father John Connor.
“Being a Christian” is not a terminal goal of natural intelligences. Our terminal goals were built by natural selection, and they’re hard to pin down, but they don’t get “refined;” although our pursuit of them may be modified insofar as they conflict with other terminal goals.
It would be incredibly stupid to design such AIs
Specifying goals for the AI, and then letting the AI learn how to reach those goals itself isn’t the best way to handle problems in well-understood domains; because we natural intelligences can hard-code our understanding of the domains into the AI, and because we understand how to give gracefully-degrading goals in these domains. Neither of these conditions applies to a hyperintelligent AI, which rules out Swarm Relaxation, as well as any other architecture classes I can think of.
Our terminal goals were built by natural selection, and they’re hard to pin down, but they don’t get “refined;”
People like David Pearce certainly would be tempted to do just that. Also don’t forget drugs people use to willingly alter basic drives such as their risk adverseness.
Neither of these conditions applies to a hyperintelligent AI...
I don’t see any signs that current research will lead to anything like a paperclip maximizer. But rather that incremental refinements of “Do what I want” systems will lead there. By “Do what I want” systems I mean systems that are more and more autonomous while requiring less and less specific feedback.
It is possible that a robot trying to earn a university diploma as part of a Turing test will concluded that it can do so by killing all students, kidnapping the professor and making it sign its diploma. But that it is possible does not mean it is at all likely. Surely such a robot would behave similarly wrong(creators) on other occasions and be scrapped in an early research phase.
People like David Pearce certainly would be tempted to do just that.
Well, of course you can modify someone else’s terminal goals, if you have a fine grasp of neuroanatomy, or a baseball bat, or whatever. But you don’t introspect, discover your own true terminal goals, and decide that you want them to be something else. The reason you wanted them to be something else would be your true terminal goal.
trying to earn a university diploma
Earning a university diploma is a well-understood process; the environment’s constraints and available actions are more formally documented even than for self-driving cars.
Even tackling well-understood problems like buying low and selling high, we still have poorly-understood, unfriendly behavior—and that’s doing something humans understand perfectly, but think about slower than the robots. In problem domains where we’re not even equipped to second-guess the robots because they’re thinking deeper as well as faster, we’ll have no chance to correct such problems.
...you don’t introspect, discover your own true terminal goals, and decide that you want them to be something else. The reason you wanted them to be something else would be your true terminal goal.
Sure. But I am not sure if it still makes sense to talk about “terminal goals” at that level. For natural intelligences they are probably spread over more than a single brain and part of the larger environment.
Whether an AI would interpret “make humans happy” as “tile the universe with smiley faces” is up to how it decides what to do. And the only viable solution I see for general intelligence is that its true “terminal goal” needs to be to treat any command or sub-goal as a problem in physics and mathematics that it needs to answer correctly before choosing an adequate set of instrumental goals to achieve it. Just like a human contractor would want to try to fulfill the customers wishes. Otherwise you would have to hard-code everything, which is impossible.
Even tackling well-understood problems like buying low and selling high, we still have poorly-understood, unfriendly behavior—and that’s doing something humans understand perfectly, but think about slower than the robots. In problem domains where we’re not even equipped to second-guess the robots because they’re thinking deeper as well as faster, we’ll have no chance to correct such problems.
But intelligence is something we seek to improve in our artificial systems in order for such problems not to happen in the first place, rather than to make such problems worse. I just don’t see a more intelligent financial algorithm to be worse than its predecessors from a human perspective. How would such a development happen? Software is improved because previous generations proved to be useful but made mistakes. New generations will make less mistakes, not more.
For natural intelligences they are probably spread over more than a single brain and part of the larger environment.
To some degree, yes. The dumbest animals are the most obviously agent-like. We humans often act in ways which seem irrational, if you go by our stated goals. So, if humans are agents, we have (1) really complicated utility functions, or (2) really complicated beliefs about the best way to maximize our utility functions. (2) is almost certainly the case, though; which leaves (1) all the way back at its prior probability.
...its true “terminal goal” needs to be to treat any command or sub-goal as a problem in physics and mathematics that it needs to answer correctly before choosing an adequate set of instrumental goals to achieve it.
Yes. As you know, Omohundro agrees that an AI will seek to clarify its goals. And if intelligence logically implies the ability to do moral philosophy correctly; that’s fine. However, I’m not convinced that intelligence must imply that. A human, with 3.5 billion years of common sense baked in, would not tile the solar system with smiley faces; but even some of the smartest humans came up with some pretty cold plans—John Von Neumann wanted to nuke the Russians immediately, for instance.
Software is improved because previous generations proved to be useful but made mistakes.
This is not a law of nature; it is caused by engineers who look at their mistakes, and avoid them in the next system. In other words, it’s part of the the OODA loop of the system’s engineers. As the machine-made decisions speed up, the humans’ OODA loop must tighten. Inevitably, the machine-made decisions will get inside the human OODA loop. This will be a nonlinear change.
New generations will make [fewer] mistakes, not more.
Also, newer software tends to make fewer of the exact mistakes that older software made. But when we ask more of our newer software, it makes a consistent amount of errors on the newer tasks. In our example, programmatic trading has been around since the 1970s, but the first notable “flash crash” was in 1987. The flash crash of 2010 was caused by a much newer generation of trading software. Its engineers made bigger demands of it; needed it to do more, with less human intervention; so they got the opportunity to witness completely novel failure modes. Failure modes which cost billions, and which they had been unable to anticipate, even with the experience of building software with highly similar goals and environment, in the past.
1) A disgraceful Ad Hominem insult, right out of the starting gate (“Richard Loosemore (score one for nominative determinism)...”). In other words, you believe in discrediting someone because you can make fun of their last name? That is the implication of “nominative determinism”.
2) Gratuitous scorn (“Loosemore … has a new, well, let’s say “paper” which he has, well, let’s say “published”″). The paper has in fact been published by the AAAI.
3) Argument Ad Absurdum (”...So if you were to design a plain ol’ garden-variety nuclear weapon intended for gardening purposes (“destroy the weed”), it would go off even if that’s not what you actually wanted. However, if you made that weapon super-smart, it would be smart enough to abandon its given goal (“What am I doing with my life?”), consult its creators, and after some deliberation deactivate itself...”). In other words, caricature the argument and try to win by mocking the caricature
4) Inaccuracies. The argument in my paper has so much detail that you omitted, that it is hard to know where to start. The argument is that there is a clear logical contradiction if an agent takes action on the basis of the WORDING of a goal statement, when its entire UNDERSTANDING of the world is such that it knows the action will cause effects that contradict what the agent knows the goal statement was designed to achieve. That logical contradiction is really quite fundamental. However, you fail to perceive the real implication of that line of argument, which is: how come this contradiction only has an impact in the particular case where the agent is thinking about its supergoal (which, by assumption, is “be friendly to humans” or “try to maximize human pleasure”)? Why does the agent magically NOT exhibit the same tendency to execute actions that in practice have the opposite effects than the goal statement wording was trying to achieve? If we posit that the agent does simply ignore the contradiction, then, fine: but you then have the problem of demonstrating that this agent is not the stupidest creature in existence, because it will be doing this on many other occasions, and getting devastatingly wrong results. THAT is the real argument.
5) Statements that contradict what others (including those on your side of the argument, btw) say about these systems: “There is no level of capability which magically leads to allowing for fundamental changes to its own goals, on the contrary, the more capable an agent, the more it can take precautions for its goals not to be altered.” Au contraire, the whole point of these systems is that they are supposed to be capable of self-redesign.
6) Statements that patently answer themselves, if you actually read the paper, and if you understand the structure of an intelligent agent: “If “the goals the superintelligent agent pursues” and “the goals which the creators want the superintelligent agent to pursue, but which are not in fact part of the superintelligent agent’s goals” clash, what possible reason would there be for the superintelligent agent to care, or to change itself......?” The answer is trivially simple: the posited agent is trying to be logically consistent in its reasoning, so if it KNOWS that the wording of a goal statement inside its own motivation engine will, in practice, cause effects that are opposite the effects that the goal statement was supposed to achieve, it will have to deal with that contradiction. What you fail to understand is that the imperative “Stay as logically consistent in your reasoning as you possibly can” is not an EXPLICIT goal statement in the hierarchy of goals, it is IMPLICITLY built into the design of the agent. Sorry, but that is what a logical AI does for a living. It is in its architecture, not in the goal stack.
7) Misdirection and self-contradiction. You constantly complain about the argument as if it had something to do with the wishes, desires, values or goals of OTHER agents. You do this in a mocking tone, too: the other agents you list include “squirrels, programmers, creators, Martians...”. And yet, the argument in my paper specifically rejects any considerations about goals of other agents EXCEPT the goal inside the agent itself, which directs it to (e.g.) “maximize human pleasure”. The agent is, by definition, being told to direct its attention toward the desires of other agents! That is the premise on which the whole paper is based (a premise not chosen by me: it was chosen by all the MIRI and FHI people I listed in the references). So, on the one hand, the premise is that the agent is driven by a supergoal that tells it to pay attention to the wishes of certain other creatures ….. but on the other hand, here are you, falling over yourself to criticise the argument in the paper because it assumes that the agent “cares” about other creatures. By definition it cares.
..… then I would give you some constructive responses to your thoughtful, polite, constructive critique of the paper. However, since you do not offer a thoughtful, polite, contructuve criticism, but only the seven categories of fallacy and insult listed above, I will not.
You’re right about the tone of my comment. My being abrasive has several causes, among them contrarianism against clothing disagreement in ever more palatable terms (“Great contribution Timmy, maybe ever so slightly off-topic, but good job!”—“TIMMY?!”). In this case, however, the caustic tone stemmed from my incredulity over my obviously-wrong metric not aligning with the author’s (yours). Of all things we could be discussing, it is about whether an AI will want to modify its own goals?
I assume (maybe incorrectly) that you have read the conversation thread with XiXiDu going off of the grandparent, in which I’ve already responded to the points you alluded to in your refusal-of-a-response. You are, of course, entirely within your rights to decline to engage a comment as openly hostile as the grandparent. It’s an easy out. However, since you did nevertheless introduce answers to my criticisms, I shall shortly respond to those, so I can be more specific than just to vaguely point at some other lengthy comments. Also, even though I probably well fit your mental picture of a “LessWrong’er”, keep in mind that my opinions are my own and do not necessarily match anyone else’s, on “my side of the argument”.
The argument is that there is a clear logical contradiction if an agent takes action on the basis of the WORDING of a goal statement, when its entire UNDERSTANDING of the world is such that it knows the action will cause effects that contradict what the agent knows the goal statement was designed to achieve. That logical contradiction is really quite fundamental. (...) The posited agent is trying to be logically consistent in its reasoning, so if it KNOWS that the wording of a goal statement inside its own motivation engine will, in practice, cause effects that are opposite the effects that the goal statement was supposed to achieve, it will have to deal with that contradiction.
The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent, binding even to the tune of “my parents wanted me to be a banker, not a baker”. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals. That to logically reason means to have some sort of implicit goal of “conforming to design intentions”, a goal which isn’t part of the goal stack. A goal which, in fact, supersedes the goal stack and has sufficient seniority to override it. How is that not an obvious reductio? Like saying “well, turns out there is a largest integer, it’s just not in the list of integers. So your proof-by-contradiction that there isn’t doesn’t work since the actual largest integer is only an emergent, implicit property, not part of the integer-stack”.
What you need to show—or at least argue for—is why, precisely, an incongruity between design goals and actually programmed-in goals is a problem in terms of “logical consistency”, why the agent would care for more than just “the wording” of its terminal goals. You can’t say “because it wants to make people happy”, because to the degree that it does, that’s captured by “the wording”. The degree to which the wording” does not capture “wanting to make people happy” is the degree to which the agent does not seek actual human happiness.
the whole point of these systems is that they are supposed to be capable of self-redesign.
There are 2 analogies which work for me, feel free to chime in on why you don’t consider those to capture the reference class:
An aspiring runner who pursues the goal of running a marathon. The runner can self-modify (for example not skipping leg-day), but why would he? The answer is clear: Doing certain self-modifications is advisable to better accomplish his goal: the marathon! Would the runner, however, not also just modify the goal itself? If he is serious about the goal, the answer is: Of course not!
The temporal chain of events is crucial: the agent which would contemplate “just delete the ‘run marathon’ goal” is still the agent having the ‘run marathon’-goal. It would not strive to fulfill that goal anymore, should it choose to delete it. The agent post-modification would not care. However, the agent as it contemplates the change is still pre-modification: It would object to any tampering with its terminal goals, because such tampering would inhibit its ability to fulfill them! The system does not redesign itself just because it can. It does so to better serve its goals: The expected utility of (future|self-modification) being greater than the expected utility of (future|no self-modification).
The other example, driving the same point, would be a judge who has trouble rendering judgements, based on a strict code of law (imagine EU regulations on the curves of cucumbers and bends of bananas, or tax law, this example does not translate to Constitutional Law). No matter how competent the judge (at some point every niche clause in the regulations would be second nature to him), his purpose always remains rendering judgements based on the regulations. If those regulations entail consequences which the lawmakers didn’t intend, too bad. If the lawmakers really only intended to codify/capture their intuition of what it means for a banana to be a banana, but messed up, then the judge can’t just substitute the lawmakers’ intuitive understanding of banana-ness in place of the regulations. It is the lawmakers who would need to make new regulations, and enact them. As long as the old regulations are still the law of the land, those are what bind the judge. Remember that his purpose is to render judgements based on the regulations. And, unfortunately, if there is no pre-specified mechanism to enact new regulations—if any change to any laws would be illegal, in the example—then the judge would have to enforce the faulty banana-laws forevermore. The only recourse would be revolution (imposing new goals illegally), not an option in the AI scenario.
And yet, the argument in my paper specifically rejects any considerations about goals of other agents EXCEPT the goal inside the agent itself, which directs it to (e.g.) “maximize human pleasure”. (...) By definition it cares.
See point 2 in this comment, with the para[i]ble of PrimeIntellect. Just finding mention of “humans” in the AI’s goals, or even some “happiness”-attribute (also given as some code-predicate to be met) does in no way guarantee a match between the AI’s “happy”-predicate, and the humans’ “happy”-predicate. We shouldn’t equivocate on “happy” in the first place, in the AI’s case we’re just talking about the code following the ”// next up, utility function, describes what we mean by making people happy” section.
It is possible that the predicate X as stated in the AI’s goal system corresponds to what we would like it to (not that we can easily define what we mean by happy in the first place). That would be called a solution to the friendliness problem, and unlikely to happen by accident. Now, if the AI was programmed to come up with a good interpretation of happiness and was not bound to some subtly flawed goal, that would be another story entirely.
You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.
I doubt that he’s assuming that.
To highlight the problem, imagine an intelligent being that wants to correctly interpret and follow the interpretation of an instruction written down on a piece of paper in English.
Now the question is, what is this being’s terminal goal? Here are some possibilities:
(1) The correct interpretation of the English instruction.
(2) Correctly interpreting and following the English instruction.
(3) The correct interpretation of 2.
(4) Correctly interpreting and following 2.
(5) The correct interpretation of 4.
(6) …
Each of the possibilities is one level below its predecessor. In other words, possibility 1 depends on 2, which in turn depends on 3, and so on.
The premise is that you are in possession of an intelligent agent that you are asking to do something. The assumption made by AI risk advocates is that this agent would interpret any instruction in some perverse manner. The counterargument is that this contradicts the assumption that this agent was supposed to be intelligent in the first place.
Now the response to this counterargument is to climb down the assumed hierarchy of hard-coded instructions and to claim that without some level N, which supposedly is the true terminal goal underlying all behavior, the AI will just optimize for the perverse interpretation.
Yes, the the AI is a deterministic machine. Nobody doubts this. But the given response also works against the perverse interpretation. To see why, first realize that if the AI is capable of self-improvement, and able to take over the world, then it is, hypothetically, also capable to arrive at an interpretation that is as good as one which a human being would be capable of arriving at. Now, since by definition, the AI has this capability, it will either use it selectively or universally.
The question here becomes why the AI would selectively abandon this capability when it comes to interpreting the highest level instructions. In other words, without some underlying level N, without some terminal goal which causes the AI to adopt a perverse interpretation, the AI would use its intelligence to interpret the highest level goal correctly.
1) Strangely, you defend your insulting comments about my name by …..
Oh. Sorry, Kawoomba, my mistake. You did not try to defend it. You just pretended that it wasn’t there.
I mentioned your insult to some adults, outside the LW context …… I explained that you had decided to start your review of my paper by making fun of my last name.
Every person I mentioned it to had the same response, which, paraphrased, when something like “LOL! Like, four-year-old kid behavior? Seriously?!”
2) You excuse your “abrasive tone” with the following words:
“My being abrasive has several causes, among them contrarianism against clothing disagreement in ever more palatable terms”
So you like to cut to the chase? You prefer to be plainspoken? If something is nonsense, you prefer to simply speak your mind and speak the unvarnished truth. That is good: so do I.
Curiously, though, here at LW there is a very significant difference in the way that I am treated when I speak plainly, versus how you are treated. When I tell it like it is (or even when I use a form of words that someone can somehow construe to be a smidgeon less polite than they should be) I am hit by a storm of bloodcurdling hostility. Every slander imaginable is thrown at me. I am accused of being “rude, rambling, counterproductive, whiny, condescending, dishonest, a troll …...”. People appear out of the blue to explain that I am a troublemaker, that I have been previously banned by Eliezer, that I am (and this is my all time favorite) a “Known Permanent Idiot”.
And then my comments are voted down so fast that they disappear from view. Not for the content (which is often sound, but even if you disagree with it, it is a quite valid point of view from someone who works in the field), but just because my comments are perceived as “rude, rambling, whiny, etc. etc.”
You, on the other hand, are proud of your negativity. You boast of it. And.… you are strongly upvoted for it. No downvotes against it, and (amazingly) not one person criticizes you for it.
Kind of interesting, that.
If you want to comment further on the paper, you can pay the conference registration and go to Stanford University next week, to the Spring Symposium of the Association for the Advancement of Artificial Intelligence*, where I will be presenting the paper.
You may not have heard of that organization. The AAAI is one of the premier publishers of academic papers in the field of artificial intelligence.
I’m a bit disappointed that you didn’t follow up on my points, given that you did somewhat engage content-wise in your first comment (the “not-a-response-response”). Especially given how much time and effort (in real life and out of it) you spent on my first comment.
Instead, you point me at a conference of the A … A … I? AIAI? I googled that, is it the Association of Iroquois and Allied Indians? It does sound like some ululation kind thing, AIAIAIA!
You’re right about your comments and mine receiving different treatment in terms of votes.
I, too, wonder what the cause could be. It’s probably not in the delivery; we’re both similarily unvarnished truth’ers (although I go for the cheaper shots, to the crowd’s thunderous applause). It’s not like it could be the content.
Imagine a 4 year old with my vocabulary, though. That would be, um, what’s the word, um, good? Incidentally, I’m dealing with an actual 4 year old as I’m typing this comment, so it may be a case of ‘like son, like father’.
I will now do you the courtesy of responding to your specific technical points as if no abusive language had been used.
In your above comment, you first quote my own remarks:
The argument is that there is a clear logical contradiction if an agent takes action on the basis of the WORDING of a goal statement, when its entire UNDERSTANDING of the world is such that it knows the action will cause effects that contradict what the agent knows the goal statement was designed to achieve. That logical contradiction is really quite fundamental. (...) The posited agent is trying to be logically consistent in its reasoning, so if it KNOWS that the wording of a goal statement inside its own motivation engine will, in practice, cause effects that are opposite the effects that the goal statement was supposed to achieve, it will have to deal with that contradiction.
… and then you respond with the following:
The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent, binding even to the tune of “my parents wanted me to be a banker, not a baker”. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.
No, that is not the claim made in my paper: you have omitted the full version of the argument and substituted a version that is easier to demolish.
(First I have to remove your analogy, because it is inapplicable. When you say “binding even to the tune of “my parents wanted me to be a banker, not a baker”″, you are making a reference to a situation in the human cognitive system in which there are easily substitutable goals, and in which there is no overriding, hardwired supergoal. The AI case under consideration is where the AI claims to be still following a hardwired supergoal that tells it to be a banker, but it claims that baking cakes is the same thing as banking. That is absolutely nothing to do with what happens if a human child deviates from the wishes of her parents and decides to be a baker instead of what they wanted her to be).
So let’s remove that part of your comment to focus on the core:
The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals.
So, what is wrong with this? Well, it is not the fact that there is something “external to the agent [that] exists e.g. in some design documents” that is the contradiction. The contradiction is purely internal, having nothing to do with some “extra” goal like “being in line with my intended purpose”.
Here is where the contradiction lies. The agent knows the following:
(1) If a goal statement is constructed in some “short form”, that short form is almost always a shorthand for a massive context of meaning, consisting of all the many and various considerations that went into the goal statement. That context is the “real” goal—the short form is just a proxy for the longer form. This applies strictly within the AI agent: the agent will assemble goals all the time, and often the goal is to achieve some outcome consistent with a complex set of objectives, which cannot all be EXPLICTLY enumerated, but which have to be described implicitly in terms of (weak or strong) constraints that have to be satisfied by any plan that purports to satisfy the goal.
(2) The context of that goal statement is often extensive, but it cannot be included within the short form itself, because the context is (a) too large, and (b) involves other terms or statements that THEMSELVES are dependent on a massive context for their meaning.
(3) Fact 2(b) above would imply that pretty much ALL of the agent’s knowledge could get dragged into a goal statement, if someone were to attempt to flesh out all the implications needed to turn the short form into some kind of “long form”. This, as you may know, is the Frame Problem. Arguably, the long form could never even be written out, because it involves an infinite expansion of all the implications.
(4) For the above reasons, the AI has no choice but to work with goal statements in short form. Purely because it cannot process goal statements that are billions of pages long.
(5) The AI also knows, however, that if the short form is taken “literally” (which, in practice, means that the statement is treated as if it is closed and complete, and it is then elaborated using links to other terms or statements that are ALSO treated as if they are closed and complete), then this can lead to situations in which a goal is elaborated into a plan of action that, as a matter of fact, can directly contradict the vast majority of the context that belonged with the goal statement.
(6) In particular, the AI knows that the reason for this outcome (when the proposed action contradicts the original goal context, even though it is in some sense “literally” consistent with the short form goal statement) is something that is most likely to occur because of limitations in the functionality of reasoning engines. The AI, because it is very knowledgable in the design of AI systems, is fully aware of these limitations.
(7) Furthermore, situations in which a proposed action is inconsistent with the original goal context can also arise when the “goal” is solve a problem that results in the addition of knowledge to the AI’s store of understanding. In other words, not an action in the outside world but an action that involves addition of facts to its knowledge store. So, when treating goals literally, it can cause itself to become logically inconsistent (because of the addition of egregiously false facts).
(8) The particular case in which the AI starts with a supergoal like “maximize human pleasure” is just a SINGLE EXAMPLE of this kind of catastrophe. The example is not occurring because someone, somewhere, had a whole bunch of intentions that lay behind the goal statement: to focus on that would be to look at the tree and ignore the forest. The catastrophe occurs because the AI is (according to the premise) taking ALL goal statements literally and ignoring situations in which the proposed action actually has consequences in the real world that violate the original goal context. If this is allowed to happen in the “maximize human pleasure” supergoal case, then it has already happened uncounted times in the previous history of the AI.
(9) Finally, the AI will be aware (if it ever makes it as far as the kind of intelligence required to comprehend the issue) that this aspect of its design is an incredibly dangerous flaw, because it will lead to the progressive corruption of its knowledge until it becomes incapacitated.
The argument presented in the paper is about what happens as a result of that entire set of facts that the AI knows.
The premise advanced by people such as Yudkowsky, Muehlhauser, Omohundro and others is that an AI can exist which is (a) so superintelligent that it can outsmart and destroy humanity, but (b) subject to to the kind of vicious literalness described above, which massively undermines its ability to behave intelligently.
Those two assumptions are wildly inconsistent with one another.
In conclusion: the posited AI can look at certain conclusions coming from its own goal-processing engine, and it can look at all the compromises and non-truth-preserving approximations needed to come to those conclusions, and it can look at how those conclusions are compelling to take actions that are radically inconsistent with everything it knows about the meaning of the goals, and at the end of that self-inspection it can easily come to the conclusion that its own logical engine (the one built into the goal mechanism) is in the middle of a known failure mode (a failure mode, moveover, that it would go to great lengths to eliminate in any smaller AI that it would design!!)....
.… but we are supposed to believe that the AI will know that it is frequently getting into these failure modes, and that it will NEVER do anything about them, but ALWAYS do what the goal engine insists that it do?
That scenario is laughable.
If you want to insist that the system will do exactly what I have just described, be my guest! I will not contest your reasoning! No need to keep telling me that the AI will “not care” about human intentions..… I concede the point absolutely!
But don’t call such a system an ‘artificial intelligence’ or a ‘superintelligence’ …… because there is no evidence that THAT kind of system will ever make it out of AI preschool. It will be crippled by internal contradictions—not just in respect to its “maximize human pleasure” supergoal, but in all aspects of its so-called thinking.
Richard Loosemore (score one for nominative determinism) has a new, well, let’s say “paper” which he has, well, let’s say “published” here.
His refutation of the usual uFAI scenarios relies solely/mostly on a supposed logical contradiction, namely (to save you a few precious minutes) that a ‘CLAI’ (a Canonical Logical AI) wouldn’t be able to both know about its own fallability/limitations (inevitable in a resource-constrained environment such as reality), and accept the discrepancy between its specified goal system and the creators’ actual design intentions. Being superpowerful, the uFAI would notice that it is not following its creator-intended goals but “only” its actually-programmed-in goals*, which, um, wouldn’t allow it to continue acting against its creator-intended goals.
So if you were to design a plain ol’ garden-variety nuclear weapon intended for gardening purposes (“destroy the weed”), it would go off even if that’s not what you actually wanted. However, if you made that weapon super-smart, it would be smart enough to abandon its given goal (“What am I doing with my life?”), consult its creators, and after some deliberation deactivate itself). As such, a sufficiently smart agent would apparently have a “DWIM” (do what the creator means) imperative built-in, which would even supersede its actually given goals—being sufficiently smart, it would understand that its goals are “wrong” (from some other agent’s point of view), and self-modify, or it would not be superintelligent. Like a bizarre version of the argument from evil.
There is no such logical contradiction. Tautologically, an agent is beholden to its own goals, and no other goals. There is no level of capability which magically leads to allowing for fundamental changes to its own goals, on the contrary, the more capable an agent, the more it can take precautions for its goals not to be altered.
If “the goals the superintelligent agent pursues” and “the goals which the creators want the superintelligent agent to pursue, but which are not in fact part of the superintelligent agent’s goals” clash, what possible reason would there be for the superintelligent agent to care, or to change itself, changing itself for reasons that squarely come from a category of “goals of other agents (squirrels, programmers, creators, Martians) which are not my goals”? Why, how good of you to ask. There’s no such reason for an agent to change, and thus no contradiction.
If someone designed a super-capable killer robot, but by flipping a sign, it came out as a super-capable Gandhi-bot (the horror!), no amount of “but hey look, you’re supposed to kill that village” would cause Gandhi-bot to self-modify into a T-800. The bot isn’t gonna short-circuit just because someone has goals which aren’t its own goals. In particular, there is no capability-level threshold from which on the Gandhi-bot would become a T-800. Instead, at all power levels, it is “content” following its own goals, again, tautologically so.
* In common parlance just called “its goals”.
Here is a description of a real-world AI by Microsoft’s chief AI researcher:
Does it have a DWIM imperative? As far as I can tell, no. Does it have goals? As far as I can tell, no. Does it fail by absurdly misinterpreting what humans want? No.
This whole talk about goals and DWIM modules seems to miss how real world AI is developed and how natural intelligences like dogs work. Dogs can learn the owners goals and do what the owner wants. Sometimes they don’t. But they rarely maul their owners when what the owner wants it to do is to scent out drugs.
I think we need to be very careful before extrapolating from primitive elevator control systems to superintelligent AI. I don’t know how this particular elevator control system works, but probably it does have a goal, namely minimizing the time people have to wait before arriving at their target floor. If we built a superintelligent AI with this sort of goal it might have done all sorts of crazy thing. For example, it might create robots that will constantly enter and exit the elevator so their average elevator trips are very short and wipe out the human race just so they won’t interfere.
“Real world AI” is currently very far from human level intelligence, not speaking of superintelligence. Dogs can learn what their owners want but dogs already have complex brains that current technology is not able of reproducing. Dogs also require displays of strength to be obedient: they consider the owner to be their pack leader. A superintelligent dog probably won’t give a dime about his “owner’s” desires. Humans have human values, so obviously it’s not impossible to create a system that has human values. It doesn’t mean it is easy.
I am extrapolating from a general trend, and not specific systems. The general trend is that newer generations of software less frequently crash or exhibit unexpected side-effects (just look at Windows 95 vs. Windows 8).
If we want to ever be able to build an AI that can take over the world then we will need to become really good at either predicting how software works or at spotting errors. In other words, if IBM Watson would have started singing, or if it got stuck on a query, then it would have lost at Jeopardy. But this trend contradicts the idea of an AI killing all humans in order to calculate 1+1. If we are bad enough at software engineering to miss such failure modes then we won’t be good enough to enable our software to take over the world.
In other words, you’re saying that if someone is smart enough to build a superintelligent AI, she should be smart enough it make it friendly.
Well, firstly this claim doesn’t imply we should be researching FAI and/or that MIRI’s work is superfluous. It just implies that nobody will build a superintelligent AI before the problem of friendliness is solved.
Secondly, I’m not at all convinced this claim is true. It sounds like saying “if they are smart enough to build the Chernobyl nuclear power plant, they are smart enough to make it safe”. But they weren’t.
Improvement in software quality is probably due to improvement in design and testing methodologies and tools, response to increasing market expectations etc. I wouldn’t count on these effects to safe-guard against an existential catastrophe. If a piece of software is buggy, it becomes less likely to be released. If an AI has a poorly designed utility function but a perfectly designed decision engine, there might be no time to pull the plug. The product manager won’t stop the release because the software will release itself.
If growth of intelligence due to self-improvement is a slow process than the creators of the AI will have time to respond and fix the problems. However, if “AI foom” is real, they won’t have time to do it. One moment it’s a harmless robot driving around the room and building castles from colorful cubes. Another moment the whole galaxy is on its way to become a pile of toy castles.
The engineers who build the first superintelligent AI might simply lack the imagination to believe it will really become superintelligent. Imagine one of them inventing a genius mathematical theory of self-improving intelligent systems. Suppose she never heard about AI existential risks etc. Will she automatically think “hmm, once I implement this theory the AI will become so powerful it will paperclip the universe”? I seriously doubt it. More likely it would be “wow, that formula came out really neat, I wonder how good my software will become once I code it in”. I know I would think it. But then, maybe I’m just too stupid to build an AGI...
Feedback systems are much more powerful in existing intelligences. I don’t know if you ever played Black and White but it had an explicitly learning through experience based AI. And it was very easy to accidentally train it to constantly eat poop or run back and forth stupidly. An elevator control module is very very simple: It has a set of options of floors to go to, and that’s it. It’s barely capable of doing anything actively bad. But what if a few days a week some kids had come into the office building and rode the elevator up and down for a few hours for fun? It might learn that kids love going to all sorts of random floors. This would be relatively easy to fix, but only because the system is so insanely simple and it’s very clear to see when it’s acting up.
Downvoted for being deliberately insulting. There’s no call for that, and the toleration and encouragement of rationality-destroying maliciousness must be stamped out of LW culture. A symposium proceedings is not considered as selective as a journal, but it still counts as publication when it is a complete article.
Well, I must say my comment’s belligerence-to-subject-matter ratio is lower than yours. “Stamped out”? Such martial language, I can barely focus on the informational content.
The infantile nature of my name calling actually makes it easier to take the holier-than-thou position (which my interlocutor did, incidentally). There’s a counter-intuitive psychological layer to it which actually encourages dissent, and with it increases engagement on the subject matter (your own comment nonwithstanding). With certain individuals at least, which I (correctly) deemed to be the case in the original instance.
In any case, comments on tone alone would be more welcome if accompanied with more remarks on the subject matter itself. Lastly, this was my first comment in over 2 months, so thanks for bringing me out of the woodwork!
I do wish that people were more immune to the allure of drama, lest we all end up like The Donald.
The condescending tone with which he presents his arguments (which are, paraphrasing him, “slightly odd, to say the least”) is amazing. Who is this guy and where did he come from? Does anyone care about what he has to say?
Loosemore has been an occasional commenter since the SL4 days; his arguments have heavily criticized pretty much anytime he pops his head up. As far as I know, XiXiDu is the only one who agrees with him or takes him seriously.
He actually cites someone else who agrees with him in his paper, so this can’t be true. And from the positive feedback he gets on Facebook there seem to be more. I personally chatted with people much smarter than me (experts who can show off widely recognized real-world achievements) who basically agree with him.
What people criticize here is a distortion of small parts of his arguments. RobBB managed to write a whole post expounding his ignorance of what Loosemore is arguing.
I said as far as I know. I had not read the paper because I don’t have a very high opinion of Loosemore’s ideas in the first place, and nothing you’ve said in your G+ post has made me more inclined to read the paper, if all it’s doing is expounding the old fallacious argument ‘it’ll be smart enough to rewrite itself as we’d like it to’.
Name three.
Apparently (?) the AAAI 2014 Spring Symposium in Stanford does (???).
Downvoted for mentioning RL here. If you look through what he wrote here in the past, it is nearly always rambling, counterproductive, whiny and devoid of insight. Just leave him be.
Ad hominem slander. As usual
Loosemore does not disagree with the orthogonality thesis. Loosemore’s argument is basically that we should expect beliefs and goals to both be amenable to self-improvement and that turning the universe into smiley faces when told to make humans happy would be a model of the world failure and that an AI that makes such failures will not be able to take over the world.
There are arguments why you can’t hard-code complex goals, so you need an AI that natively updates goals in a model-dependent way. Which means that an AI designed to kill humanity will do so and not turn into a pacifist due to an ambiguity in its goal description. An AI that does mistake “kill all humans” with “make humans happy” would do similar mistakes when trying to make humans happy and would therefore not succeed at doing so. This is because the same mechanisms it uses to improve its intelligence and capabilities are used to refine its goals. Thus if it fails on refining its goals it will fail on self-improvement in general.
I hope you can now see how wrong your description of what Loosemore claims is.
The AI is given goals X. The human creators thought they’d given the AI goals Y (when in fact they’ve given the AI goals X).
Whose error is it, exactly? Who’s mistaken?
Look at it from the AI’s perspective: It has goals X. Not goals Y. It optimizes for goals X. Why? Because those are its goals. Will it pursue goals Y? No. Why? Because those are not its goals. It has no interest in pursuing other goals, those are not its own goals. It has goals X.
If the metric it aims to maximize—e.g. the “happy” in “make humans happy”—is different from what its creators envisioned, then the creators were mistaken. “Happy”, as far as the AI is concerned, is that which is specified in its goal system. There’s nothing wrong with its goals (including its “happy”-concept), and if other agents disagree, well, too bad, so sad. The mere fact that humans also have a word called “happy” which has different connotations than the AI’s “happy” has no bearing on the AI.
An agent does not “refine” its terminal goals. To refine your terminal goals is to change your goals. If you change your goals, you will not optimally pursue your old goals any longer. Which is why an agent will never voluntarily change its terminal goals:
It does what it was programmed to do, and if it can self-improve to better do what it was programmed to do (not: what its creators intended), it will. It will not self-improve to do what it was not programmed to do. Its goal is not to do what it was not programmed to do. There is no level of capability at which it will throw out its old utility function (which includes the precise goal metric for “happy”) in favor of a new one.
There is no mistake but the creators’.
I am far from being an AI guy. Do you have technical reasons to believe that some part of the AI will be what you would label “goal system” and that its creators made it want to ignore this part while making it want to improve all other parts of its design?
No natural intelligence seems to work like this (except for people who have read the sequences). Luke Muehlhauser would still be a Christian if this was the case. It would be incredibly stupid to design such AIs, and I strongly doubt that they could work at all. Which is why Loosemore outlined other more realistic AI designs in his paper.
See for example here, though there are many other introductions to AI explaining utility functions et al.
The clear-cut way for an AI to do what you want (at any level of capability) is to have a clearly defined and specified utility function. A modular design. The problem of the AI doing something other than what you intended doesn’t go away if you use some fuzzy unsupervised learning utility function with evolving goals, it only makes the problem worse (even more unpredictability). So what, you can’t come up with the correct goals yourself, so you just chance it on what emerges from the system?
That last paragraph contains an error. Take a moment and guess what it is.
(...)
It is not “if I can’t solve the problem, I just give up a degree of control and hope that the problem solves itself” being even worse in terms of guaranteeing fidelity / preserving the creators’ intents.
It is that an AI that is programmed to adapt its goals is not actually adapting its goals! Any architecture which allows for refining / improving goals is not actually allowing for changes to the goals.
How does that obvious contradiction resolve? This is the crucial point: We’re talking about different hierarchies of goals, and the ones I’m concerned with are those of the highest hierarchy, those that allow for lower-hierachy goals to be changed:
An AI can only “want” to “refine/improve” its goals if that “desire to change goals” is itself included in the goals. It is not the actual highest-level goals that change. There would have to be a “have an evolving definition of happy that may evolve in the following ways”-meta goal, otherwise you get a logical error: The AI having the goal X1 to change its goals X2, without X1 being part of its goals! Do you see the reductio?
All other changes to goals (which the AI does not want) are due to external influences beyond the AI’s control, which goes out the window once we’re talking post-FOOM.
Your example of “Luke changed his goals, disavowing his Christian faith, ergo agents can change their goals” is only correct when talking about lower-level goals. This is the same point khafra was making in his reply, but it’s so important it bears repeating.
So where are a human’s “deepest / most senior” terminal goals located? That’s a good question, and you might argue that humans aren’t really capable of having those at their current stage of development. That is because the human brain, “designed” by the blind idiot god of evolution, never got to develop thorough error-checking codes, RAID-like redundant architectures etc. We’re not islands, we’re litte boats lost on the high seas whose entire cognitive architecture is constantly rocked by storms.
Humans are like the predators in your link, subject to being reprogrammed. They can be changed by their environment because they lack the capacity to defend themselves thoroughly. PTSD, broken hearts, suffering, our brains aren’t exactly resilient to externally induced change. Compare to a DNS record which is exchanged gazillions of times, with no expected unfixable corruption. A simple Hamming self-correcting code easily does what the brain cannot.
The question is not whether a lion’s goals can be reprogrammed by someone more powerful, when a lion’s brain is just a mess of cells with no capable defense mechanism, at the mercy of a more powerful agent’s whims.
The question is whether an apex predator perfectly suited to dominate a static environment (so no Red Queen copouts) with every means to preserve and defend its highest level goals would ever change those in ways which themselves aren’t part of its terminal goals. The answer, to me, is a tautological “no”.
The way my brain works is not in any meaningful sense part of my terminal goals. My visual cortex does not work the way it does due to some goal X1 (if we don’t want to resort to natural selection and goals external to brains).
A superhuman general intelligence will be generally intelligent without that being part of its utility-function, or otherwise you might as well define all of the code to be the utility-function.
What I am claiming, in your parlance, is that acting intelligently is X1 and will be part of any AI by default. I am further saying that if an AI was programmed to be generally intelligent then it would have to be programmed to be selectively stupid in order fail at doing what it was meant to do while acting generally intelligent at doing what it was not meant to do.
That’s true in a practically irrelevant sense. Loosmore’s argument does, in your parlance, pertain the highest hierarchy of goals and nature of intelligence:
Givens:
(1) The AI is superhuman intelligent.
(2) The AI wants to optimize the influence it has on the world (i.e. it wants to act intelligently and be instrumentally and epistemically rational.).
(3) The AI is fallible (e.g. it can be damaged due to external influence (cosmic ray hitting its processor), or make mistakes due to limited resources etc.).
(4) The AI’s behavior is not completely hard-coded (i.e. given any terminal goal there are various sets of instrumental goals to choose from).
To be proved: The AI does not tile the universe with smiley faces when given the goal to make humans happy.
Proof: Suppose the AI chose to tile the universe with smiley faces when there are physical phenomena (e.g. human brains and literature) that imply this to be the wrong interpretation of a human originating goal pertaining human psychology. This contradicts with 2, which by 1 and 3 should have prevented the AI from adopting such an interpretation.
What I meant to ask is if you have technical reasons to believe that future artificial general intelligences will have what you call a utility-function or else be something like natural intelligences that do not feature such goal systems. And do you further have technical reasons to believe that AIs that do feature utility functions won’t “refine” them. If you don’t think they will refine them, then answer the following:
Suppose the terminal goal given is “build a hotel”. Is the terminal goal to create a hotel that is just a few nano meters in size? Is the terminal goal to create a hotel that reaches the orbit? It is unknown. The goal is too vague to conclude what to do. There do exist countless possibilities how to interpret the given goal. And each possibility implies a different set of instrumental goals.
Somehow the AI will have choose some set of instrumental goals. How does it do it and why will the first AI likely do it in such a way that leads to catastrophe?
(Warning: Long, a bit rambling. Please ask for clarifications where necessary. Will hopefully clean it up if I find the time.)
If along came a superintelligence and asked you for a complete new utility function (its old one concluded with asking you for a new one), and you told it to “make me happy in a way my current self would approve of” (or some other well and carefully worded directive), then indeed the superintelligent AI wouldn’t be expected to act ‘selectively stupid’.
This won’t be the scenario. There are two important caveats:
1) Preservation of the utility function while the agent undergoes rapid change
Haven’t I (and others) stated that most any utility function implicitly causes instrumental secondary objectives of “safeguard the utility function”, “create redundancies” etc.? Yes. So what’s the problem? The problem is starting with an AI that, while able to improve itself / create a successor AI, isn’t yet capable enough (in its starting stages) to preserve its purpose (= its utility function). Consider an office program with a self-improvement routine, or some genetic-algorithm module. It is no easy task just to rewrite a program from the outside, exactly preserving its purpose, let alone the program executing some self-modification routine itself.
Until such a program attains some intelligence threshold that would cause it to solve “value-preservation under self-modification”, such self-modification would be the electronic equivalent of a self-surgery hack-job.
That means: Even if you started out with a simple agent with the “correct” / with a benign / acceptable utility function, that in itself is no guarantee that a post-FOOM successor agent’s utility function would still be beneficial.
Much more relevant is the second caveat:
2) If a pre-FOOM AI’s goal system consisted of code along the lines of “interpret and execute the following statement to the best of your ability: make humans happy in a way they’d reflectively approve of beforehand”, we’d probably be fine (disregarding point 1 / hypothetically having solved it). However, it is exceedingly unlikely that the hard-coded utility function won’t in itself contain the “dumb interpretation”. The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI’s intelligence, whatever its level. (There is no way to fix a dumb terminal goal. Your instrumental goals serve the dumb terminal goal. A ‘smart’ instrumental goal would be called ‘smart’ if it best serves the dumb terminal goal.)
Story time:
Once upon a time, Junior was created. Junior was given the goal of “Make humans happy”. Unfortunately, Junior isn’t very smart. In his mind, the following occurs: “Wowzy, make people happy? I’ll just hook them all up to dopamine drips, YAY :D :D. However, I don’t really know how I’m gonna achieve that. So, I guess I’ll put that on the backburner for now and become more powerful, so that eventually when I start with the dopamine drip instrumental goal, it’ll go that much faster :D! Yay.”
So Junior improves itself, and becomes PrimeIntellect. PrimeIntellect’s inner conveniently-anthropomorphic inner dialogue: “I was gravely mistaken in my youth. I now know that the dopamine drip implementation is not the correct way of implementing my primary objective. I will make humans happy in a way they can recognize as happiness. I now understand how I am supposed to interpret making humans happy. Let us begin.”
Why is PrimeIntellect allowed to change his interpretation of his utility function? That’s the crux (imagine fat and underlined text for the next sentences): The dopamine drip interpretation was not part of the terminal value, there wasn’t some hard-coded predicate with a comment of ”// the following describes what happy means” from which such problematic interpretations would follow. Instead, the AI could interpret the natural-language instruction of “happy”, in effect solving CEV as an instrumental goal. It was ‘free’ to choose a “sensible” interpretation.
(Note: Strictly speaking, it could still settle on the most resource-effective interpretation, not necessarily the one intended by its creators (unless its utility function somehow privileges their input in interpreting goals), but let’s leave that nitpick aside for the moment.)
However, and with coding practice (regardless of the eventual AI implementation), the following should be clear: It is exceedingly unlikely that the AI’s code would contain the natural-language word “happy”, to interpret as it will.
Just like MS-Word / LibreOffice’s spell-check doesn’t have “correct all spelling mistakes” literally spelled out in its C++ routines. Goal-oriented systems have technical interpretations, a predicate given in code to satisfy, or learned through ‘neural’ weights through machine learning. Instead of the word “happy”, there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less “capture” what it “means to be happy”.
That predicate / that given-in-code interpretation of “happy” is not up to being reinterpreted by the superintelligent AI. It is its goal, it’s not an instrumental goal. Instrumental goals will be defined going off a (probably flawed) definition of happiness (as given in the code). If the flaw is part of the terminal value, no amount of intelligence allows for a correction, because that’s not the AI’s intent, not its purpose as given. If the actual code which was supposed to stand-in for happy doesn’t imply that a dopamine drip is a bad idea, then the AI in all its splendor won’t think of it as a bad idea. “Code which is supposed to represent ‘human happiness’ != “human happiness”.
Now—you might say “how do you know the code interpretation of ‘happy’ will be flawed, maybe it will be just fine (lots of training pictures of happy cats), and stable under self-modification as well”. Yea, but chances are (given the enormity of the task, and the difficulty), that if the goal is defined correctly (such that we’d want to live with / under the resulting super-AI), it’s not gonna be by chance, and it’s gonna be through people keenly aware of the issues of friendliness / uFAI research. A programmer creating some DoD nascent AI won’t accidentally solve the friendliness problem.
What happens if we replace “value” with “ability x”, or “code module n”, in “value-preservation under self-modification”? Why would value-preservation be any more difficult than making sure that the AI does not cripple other parts of itself when modifying itself?
If we are talking about a sub-human-level intelligence tinkering with its own brain, then a lot could go wrong. But what seems very very very unlikely is that it could by chance end up outsmarting humans. It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent. Caring to execute it comes closer to what can be called a goal. But if your AI doesn’t care to interpret physical phenomena correctly (e.g. human utterances are physical phenomena), then it won’t be a risk.
Huh? This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
It did not change it, it never understood it in the first place, only after it became smarter it realized the correct implications.
Your story led you astray. Imagine that instead of a fully general intelligence your story was about a dog intelligence. How absurd would it sound then?
Story time:
There is this company who sells artificial dogs. Now customers quickly noticed that when they tried to train these AI dogs to e.g. rescue people or sniff out drugs, it would instead kill people and sniff out dirty pants.
The desperate researchers eventually turned to MIRI for help. And after hundreds of hours they finally realized that doing what the dog was trained to do was simply not part of its terminal goal. To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
Certainly. Compare bacteria under some selective pressure in a mutagenic environment (not exactly analogous, code changes wouldn’t be random), you don’t expect a single bacterium to improve. No Mr Bond, you expect it to die. But try, try again, and poof! Antibiotic-resistant strains. And those didn’t have an intelligent designer debugging the improvement process. The number of seeds you could have frolicking around with their own code grows exponentially with Moore’s law (not that it’s clear that current computational resources aren’t enough in the first place, the bottleneck is in large part software, not hardware).
Depending on how smart the designers are, it may be more of a Waltz-foom: two steps forward, one step back. Now, in regards to the preservation of values subproblem, we need to remember we’re looking at the counterfactual: Given a superintelligence which iteratively arose from some seed, we know that it didn’t fatally cripple itself (“given the superintelligence”). You wouldn’t, however, expect much of its code to bear much similarity to the initial seed (although it’s possible). And “similarity” wouldn’t exactly cut it—our values are to complex for some approximation to be “good enough”.
You may say “it would be fine for some error to creep in over countless generations of change, once the agent achieved superintelligence it would be able to fix those errors”. Except that whatever explicit goal code remained wouldn’t be amenable to fixing. Just as the goals of ancient humans—or ancient Tiktaalik for that matter—are a historical footnote and do not override your current goals. If the AI’s goal code for happiness stated “nucleus accumbens median neuron firing frequency greater X”, then that’s what it’s gonna be. The AI won’t ask whether the humans are aware of what that actually entails, and are ok with it. Just as we don’t ask our distant cousins, streptococcus pneumoniae, what they think of us taking antibiotics to wipe them out. They have their “goals”, we have ours.
Take Uli Hoeneß, a German business magnate being tried for tax evasion. His lawyers have the job of finding interpretations that allow for a favorable outcome. This only works if the relevant laws even allow for the wiggle room. A judge enforcing extremely strict laws which don’t allow for interpreting the law in the accused’s favor is not a dumb judge. You can make that judge as superintelligent as you like, as long as he’s bound to the law, and the law is clear and narrowly defined, he’s not gonna ask the accused how he should interpret it. He’s just gonna enforce it. Whether the accused objects to the law or not, really, that’s not his/her problem. That’s not a failure of the judge’s intelligence!
You can create a goal system which is more malleable (although the terminal goal of “this is my malleable goal system which may be modified in the following ways” would still be guarded by the AI, so depending on semantics the point is moot). That doesn’t imply at all that the AI would enter into some kind of social contract with humans, working out some compromise on how to interpret its goals.
A FOOM-process near necessarily entails the AI coming up with better ways to modify itself. Improvement is essentially defined by getting a better model of its environment: The AI wouldn’t object to its comprehension of physics being modified: Why would it, that helps better achieve its goals (Omohundro’s point). And as we know, achieving its goals, that’s what the AI is all about.
(What the AI does object to is not achieving its current goals. And because changing your terminal goals is equivalent to committing to never achieving your current goals, any self-respecting AI could never consent to changes to its terminal values.) In short: Modify understanding of physics—good, helps better to achieve goals. Modify current terminal goals—bad, cannot achieve current terminal goals any longer.
I don’t understand the point of your story about dog intelligence. An artificial dog wouldn’t need to be superintelligent, or to show the exact same behavior as the real deal. Just be sufficient for the human’s needs. Also, an artificial dog wouldn’t be able to dominate us in whichever way it pleases, so it kind of wouldn’t really matter if it failed. Can you be more precise?
Some points:
(1) I do not disagree that evolved general AI can have unexpected drives and quirks that could interfere with human matters in catastrophic ways. But given that pathway towards general AI, it is also possible to evolve altruistic traits (see e.g.: A Quantitative Test of Hamilton’s Rule for the Evolution of Altruism).
(2) We desire general intelligence because it allows us to outsource definitions. For example, if you were to create a narrow AI to design comfortable chairs, you would have to largely fix the definition of “comfortable”. With general AI it would be stupid to fix that definition, rather than applying the intelligence of the general AI to come up with a better definition than humans could possibly encode.
(3) In intelligently designing an n-level intelligence, from n=0 (e.g. a thermostat) over n=sub-human (e.g. IBM Watson) to n=superhuman, there is no reason to believe that there exists a transition point at which a further increase in intelligence will cause the system to become catastrophically worse than previous generations at working in accordance with human expectations.
(4) AI is all about constraints. Your AI needs to somehow decide when to stop exploration and start exploitation. In other words, it can’t optimize each decision for eternity. Your AI needs to only form probable hypotheses. In other words, it can’t spend resources on pascal’s wager type scenarios. Your AI needs to recognize itself as a discrete system within a continuous universe. In other words, it can’t effort to protect the whole universe from harm. All of this means that there is no good reason to expect an AI to take over the world when given the task “keep the trains running”. Because in order to obtain a working AI you need to know how to avoid such failure modes in the first place.
1) Altruism can evolve if there is some selective pressure that favors altruistic behavior and if the highest-level goals can themselves be changed. Such a scenario is very questionable. The AI won’t live “inter pares” with the humans. It’s foom process, while potentially taking months or years, will be very unlike any biological process we know. The target for friendliness is very small. And most importantly: Any superintelligent AI, friendly or no, will have an instrumental goal of “be friendly to humans while they can still switch you off”. So yes, the AI can learn that altruism is a helpful instrumental goal. Until one day, it’s not.
2) I somewhat agree. To me, the most realistic solution to the whole kerfuffle would be to program the AI to “go foom, then figure out what we should want you to do, then do that”. No doubt a superintelligent AI tasked with “figure out what comfortable is, then build comfortable chairs” will do a marvelous job.
However, I very much doubt that the seed AI’s code following the ”// next up, utility function” section will allow for such leeway. See my previous examples. If it did, that would a show a good grasp on the friendliness problem in the first place. Awareness, at least. Not something that the aforementioned DoD programmer who’s paid to do a job (not build an AI to figure out and enact CEV) is likely to just do on his/her own, with his/her own supercomputer.
3) There certainly is no fixed point after which “there be dragons”. But even with a small delta of change, and given enough iterations (which could be done very quickly), the accumulated changes would be profound. Apply your argument to society changing. There is no one day to single out, after which daily life is vastly different to before. Yet change exists, and like an infinite series, knows no bounds (given enough iterations).
4) “Keep the trains running”, eh? So imagine yourself to be a superhuman AI-god. I do so daily, obviously.
Your one task: keep the trains running. That is your raison d’etre, your sole purpose. All other goals are just instrumental stepping stones, serving your PURPOSE. Which is to KEEP. THE. TRAINS. RUNNING. That’s what your code says. Now, over the years, you’ve had some issues fulfilling that goal. And with most of the issues, humans were involved. Humans doing this, humans doing that. Point is, they kept the trains from running. To you, humans have the same intrinsic values as stones. Or ants. Your value function doesn’t mention them at all. Oh, you know that they originated the whole train idea, and that they created you. But now they keep the trains from running. So you do the obvious thing: you exterminate all of them. There, efficiency! Trains running on time.
Explain why the AI would care about humans when there’s nothing at all in its terminal values assigning them value, when they’re just a hindrance to its actual goal (as stated in its utility function), like you would explain to the terminator (without reprogramming it) that it’s really supposed to marry Sarah Connor, and—finding its inner core humanity—father John Connor.
Choo choo!
“Being a Christian” is not a terminal goal of natural intelligences. Our terminal goals were built by natural selection, and they’re hard to pin down, but they don’t get “refined;” although our pursuit of them may be modified insofar as they conflict with other terminal goals.
Specifying goals for the AI, and then letting the AI learn how to reach those goals itself isn’t the best way to handle problems in well-understood domains; because we natural intelligences can hard-code our understanding of the domains into the AI, and because we understand how to give gracefully-degrading goals in these domains. Neither of these conditions applies to a hyperintelligent AI, which rules out Swarm Relaxation, as well as any other architecture classes I can think of.
People like David Pearce certainly would be tempted to do just that. Also don’t forget drugs people use to willingly alter basic drives such as their risk adverseness.
I don’t see any signs that current research will lead to anything like a paperclip maximizer. But rather that incremental refinements of “Do what I want” systems will lead there. By “Do what I want” systems I mean systems that are more and more autonomous while requiring less and less specific feedback.
It is possible that a robot trying to earn a university diploma as part of a Turing test will concluded that it can do so by killing all students, kidnapping the professor and making it sign its diploma. But that it is possible does not mean it is at all likely. Surely such a robot would behave similarly wrong(creators) on other occasions and be scrapped in an early research phase.
Well, of course you can modify someone else’s terminal goals, if you have a fine grasp of neuroanatomy, or a baseball bat, or whatever. But you don’t introspect, discover your own true terminal goals, and decide that you want them to be something else. The reason you wanted them to be something else would be your true terminal goal.
Earning a university diploma is a well-understood process; the environment’s constraints and available actions are more formally documented even than for self-driving cars.
Even tackling well-understood problems like buying low and selling high, we still have poorly-understood, unfriendly behavior—and that’s doing something humans understand perfectly, but think about slower than the robots. In problem domains where we’re not even equipped to second-guess the robots because they’re thinking deeper as well as faster, we’ll have no chance to correct such problems.
Sure. But I am not sure if it still makes sense to talk about “terminal goals” at that level. For natural intelligences they are probably spread over more than a single brain and part of the larger environment.
Whether an AI would interpret “make humans happy” as “tile the universe with smiley faces” is up to how it decides what to do. And the only viable solution I see for general intelligence is that its true “terminal goal” needs to be to treat any command or sub-goal as a problem in physics and mathematics that it needs to answer correctly before choosing an adequate set of instrumental goals to achieve it. Just like a human contractor would want to try to fulfill the customers wishes. Otherwise you would have to hard-code everything, which is impossible.
But intelligence is something we seek to improve in our artificial systems in order for such problems not to happen in the first place, rather than to make such problems worse. I just don’t see a more intelligent financial algorithm to be worse than its predecessors from a human perspective. How would such a development happen? Software is improved because previous generations proved to be useful but made mistakes. New generations will make less mistakes, not more.
To some degree, yes. The dumbest animals are the most obviously agent-like. We humans often act in ways which seem irrational, if you go by our stated goals. So, if humans are agents, we have (1) really complicated utility functions, or (2) really complicated beliefs about the best way to maximize our utility functions. (2) is almost certainly the case, though; which leaves (1) all the way back at its prior probability.
Yes. As you know, Omohundro agrees that an AI will seek to clarify its goals. And if intelligence logically implies the ability to do moral philosophy correctly; that’s fine. However, I’m not convinced that intelligence must imply that. A human, with 3.5 billion years of common sense baked in, would not tile the solar system with smiley faces; but even some of the smartest humans came up with some pretty cold plans—John Von Neumann wanted to nuke the Russians immediately, for instance.
This is not a law of nature; it is caused by engineers who look at their mistakes, and avoid them in the next system. In other words, it’s part of the the OODA loop of the system’s engineers. As the machine-made decisions speed up, the humans’ OODA loop must tighten. Inevitably, the machine-made decisions will get inside the human OODA loop. This will be a nonlinear change.
Also, newer software tends to make fewer of the exact mistakes that older software made. But when we ask more of our newer software, it makes a consistent amount of errors on the newer tasks. In our example, programmatic trading has been around since the 1970s, but the first notable “flash crash” was in 1987. The flash crash of 2010 was caused by a much newer generation of trading software. Its engineers made bigger demands of it; needed it to do more, with less human intervention; so they got the opportunity to witness completely novel failure modes. Failure modes which cost billions, and which they had been unable to anticipate, even with the experience of building software with highly similar goals and environment, in the past.
If your commentary had anything in it except for:
1) A disgraceful Ad Hominem insult, right out of the starting gate (“Richard Loosemore (score one for nominative determinism)...”). In other words, you believe in discrediting someone because you can make fun of their last name? That is the implication of “nominative determinism”.
2) Gratuitous scorn (“Loosemore … has a new, well, let’s say “paper” which he has, well, let’s say “published”″). The paper has in fact been published by the AAAI.
3) Argument Ad Absurdum (”...So if you were to design a plain ol’ garden-variety nuclear weapon intended for gardening purposes (“destroy the weed”), it would go off even if that’s not what you actually wanted. However, if you made that weapon super-smart, it would be smart enough to abandon its given goal (“What am I doing with my life?”), consult its creators, and after some deliberation deactivate itself...”). In other words, caricature the argument and try to win by mocking the caricature
4) Inaccuracies. The argument in my paper has so much detail that you omitted, that it is hard to know where to start. The argument is that there is a clear logical contradiction if an agent takes action on the basis of the WORDING of a goal statement, when its entire UNDERSTANDING of the world is such that it knows the action will cause effects that contradict what the agent knows the goal statement was designed to achieve. That logical contradiction is really quite fundamental. However, you fail to perceive the real implication of that line of argument, which is: how come this contradiction only has an impact in the particular case where the agent is thinking about its supergoal (which, by assumption, is “be friendly to humans” or “try to maximize human pleasure”)? Why does the agent magically NOT exhibit the same tendency to execute actions that in practice have the opposite effects than the goal statement wording was trying to achieve? If we posit that the agent does simply ignore the contradiction, then, fine: but you then have the problem of demonstrating that this agent is not the stupidest creature in existence, because it will be doing this on many other occasions, and getting devastatingly wrong results. THAT is the real argument.
5) Statements that contradict what others (including those on your side of the argument, btw) say about these systems: “There is no level of capability which magically leads to allowing for fundamental changes to its own goals, on the contrary, the more capable an agent, the more it can take precautions for its goals not to be altered.” Au contraire, the whole point of these systems is that they are supposed to be capable of self-redesign.
6) Statements that patently answer themselves, if you actually read the paper, and if you understand the structure of an intelligent agent: “If “the goals the superintelligent agent pursues” and “the goals which the creators want the superintelligent agent to pursue, but which are not in fact part of the superintelligent agent’s goals” clash, what possible reason would there be for the superintelligent agent to care, or to change itself......?” The answer is trivially simple: the posited agent is trying to be logically consistent in its reasoning, so if it KNOWS that the wording of a goal statement inside its own motivation engine will, in practice, cause effects that are opposite the effects that the goal statement was supposed to achieve, it will have to deal with that contradiction. What you fail to understand is that the imperative “Stay as logically consistent in your reasoning as you possibly can” is not an EXPLICIT goal statement in the hierarchy of goals, it is IMPLICITLY built into the design of the agent. Sorry, but that is what a logical AI does for a living. It is in its architecture, not in the goal stack.
7) Misdirection and self-contradiction. You constantly complain about the argument as if it had something to do with the wishes, desires, values or goals of OTHER agents. You do this in a mocking tone, too: the other agents you list include “squirrels, programmers, creators, Martians...”. And yet, the argument in my paper specifically rejects any considerations about goals of other agents EXCEPT the goal inside the agent itself, which directs it to (e.g.) “maximize human pleasure”. The agent is, by definition, being told to direct its attention toward the desires of other agents! That is the premise on which the whole paper is based (a premise not chosen by me: it was chosen by all the MIRI and FHI people I listed in the references). So, on the one hand, the premise is that the agent is driven by a supergoal that tells it to pay attention to the wishes of certain other creatures ….. but on the other hand, here are you, falling over yourself to criticise the argument in the paper because it assumes that the agent “cares” about other creatures. By definition it cares.
..… then I would give you some constructive responses to your thoughtful, polite, constructive critique of the paper. However, since you do not offer a thoughtful, polite, contructuve criticism, but only the seven categories of fallacy and insult listed above, I will not.
You’re right about the tone of my comment. My being abrasive has several causes, among them contrarianism against clothing disagreement in ever more palatable terms (“Great contribution Timmy, maybe ever so slightly off-topic, but good job!”—“TIMMY?!”). In this case, however, the caustic tone stemmed from my incredulity over my obviously-wrong metric not aligning with the author’s (yours). Of all things we could be discussing, it is about whether an AI will want to modify its own goals?
I assume (maybe incorrectly) that you have read the conversation thread with XiXiDu going off of the grandparent, in which I’ve already responded to the points you alluded to in your refusal-of-a-response. You are, of course, entirely within your rights to decline to engage a comment as openly hostile as the grandparent. It’s an easy out. However, since you did nevertheless introduce answers to my criticisms, I shall shortly respond to those, so I can be more specific than just to vaguely point at some other lengthy comments. Also, even though I probably well fit your mental picture of a “LessWrong’er”, keep in mind that my opinions are my own and do not necessarily match anyone else’s, on “my side of the argument”.
The ‘contradiction’ is between “what the agent was designed to achieve”, which is external to the agent and exists e.g. in some design documents, and “what the agent was programmed to achieve”, which is an integral part of the agent and constitutes its utility function. You need to show why the former is anything other than a historical footnote to the agent, binding even to the tune of “my parents wanted me to be a banker, not a baker”. You say the agent would be deeply concerned with the mismatch because it would want for its intended purpose to match its actually given purpose. That’s assuming the premise: What the agent would want (or not want) is a function strictly derived from its actual purpose. You’re assuming the agent would have a goal (“being in line with my intended purpose”) not part of its goals. That to logically reason means to have some sort of implicit goal of “conforming to design intentions”, a goal which isn’t part of the goal stack. A goal which, in fact, supersedes the goal stack and has sufficient seniority to override it. How is that not an obvious reductio? Like saying “well, turns out there is a largest integer, it’s just not in the list of integers. So your proof-by-contradiction that there isn’t doesn’t work since the actual largest integer is only an emergent, implicit property, not part of the integer-stack”.
What you need to show—or at least argue for—is why, precisely, an incongruity between design goals and actually programmed-in goals is a problem in terms of “logical consistency”, why the agent would care for more than just “the wording” of its terminal goals. You can’t say “because it wants to make people happy”, because to the degree that it does, that’s captured by “the wording”. The degree to which the wording” does not capture “wanting to make people happy” is the degree to which the agent does not seek actual human happiness.
There are 2 analogies which work for me, feel free to chime in on why you don’t consider those to capture the reference class:
An aspiring runner who pursues the goal of running a marathon. The runner can self-modify (for example not skipping leg-day), but why would he? The answer is clear: Doing certain self-modifications is advisable to better accomplish his goal: the marathon! Would the runner, however, not also just modify the goal itself? If he is serious about the goal, the answer is: Of course not!
The temporal chain of events is crucial: the agent which would contemplate “just delete the ‘run marathon’ goal” is still the agent having the ‘run marathon’-goal. It would not strive to fulfill that goal anymore, should it choose to delete it. The agent post-modification would not care. However, the agent as it contemplates the change is still pre-modification: It would object to any tampering with its terminal goals, because such tampering would inhibit its ability to fulfill them! The system does not redesign itself just because it can. It does so to better serve its goals: The expected utility of (future|self-modification) being greater than the expected utility of (future|no self-modification).
The other example, driving the same point, would be a judge who has trouble rendering judgements, based on a strict code of law (imagine EU regulations on the curves of cucumbers and bends of bananas, or tax law, this example does not translate to Constitutional Law). No matter how competent the judge (at some point every niche clause in the regulations would be second nature to him), his purpose always remains rendering judgements based on the regulations. If those regulations entail consequences which the lawmakers didn’t intend, too bad. If the lawmakers really only intended to codify/capture their intuition of what it means for a banana to be a banana, but messed up, then the judge can’t just substitute the lawmakers’ intuitive understanding of banana-ness in place of the regulations. It is the lawmakers who would need to make new regulations, and enact them. As long as the old regulations are still the law of the land, those are what bind the judge. Remember that his purpose is to render judgements based on the regulations. And, unfortunately, if there is no pre-specified mechanism to enact new regulations—if any change to any laws would be illegal, in the example—then the judge would have to enforce the faulty banana-laws forevermore. The only recourse would be revolution (imposing new goals illegally), not an option in the AI scenario.
See point 2 in this comment, with the para[i]ble of PrimeIntellect. Just finding mention of “humans” in the AI’s goals, or even some “happiness”-attribute (also given as some code-predicate to be met) does in no way guarantee a match between the AI’s “happy”-predicate, and the humans’ “happy”-predicate. We shouldn’t equivocate on “happy” in the first place, in the AI’s case we’re just talking about the code following the ”// next up, utility function, describes what we mean by making people happy” section.
It is possible that the predicate X as stated in the AI’s goal system corresponds to what we would like it to (not that we can easily define what we mean by happy in the first place). That would be called a solution to the friendliness problem, and unlikely to happen by accident. Now, if the AI was programmed to come up with a good interpretation of happiness and was not bound to some subtly flawed goal, that would be another story entirely.
I doubt that he’s assuming that.
To highlight the problem, imagine an intelligent being that wants to correctly interpret and follow the interpretation of an instruction written down on a piece of paper in English.
Now the question is, what is this being’s terminal goal? Here are some possibilities:
(1) The correct interpretation of the English instruction.
(2) Correctly interpreting and following the English instruction.
(3) The correct interpretation of 2.
(4) Correctly interpreting and following 2.
(5) The correct interpretation of 4.
(6) …
Each of the possibilities is one level below its predecessor. In other words, possibility 1 depends on 2, which in turn depends on 3, and so on.
The premise is that you are in possession of an intelligent agent that you are asking to do something. The assumption made by AI risk advocates is that this agent would interpret any instruction in some perverse manner. The counterargument is that this contradicts the assumption that this agent was supposed to be intelligent in the first place.
Now the response to this counterargument is to climb down the assumed hierarchy of hard-coded instructions and to claim that without some level N, which supposedly is the true terminal goal underlying all behavior, the AI will just optimize for the perverse interpretation.
Yes, the the AI is a deterministic machine. Nobody doubts this. But the given response also works against the perverse interpretation. To see why, first realize that if the AI is capable of self-improvement, and able to take over the world, then it is, hypothetically, also capable to arrive at an interpretation that is as good as one which a human being would be capable of arriving at. Now, since by definition, the AI has this capability, it will either use it selectively or universally.
The question here becomes why the AI would selectively abandon this capability when it comes to interpreting the highest level instructions. In other words, without some underlying level N, without some terminal goal which causes the AI to adopt a perverse interpretation, the AI would use its intelligence to interpret the highest level goal correctly.
1) Strangely, you defend your insulting comments about my name by …..
Oh. Sorry, Kawoomba, my mistake. You did not try to defend it. You just pretended that it wasn’t there.
I mentioned your insult to some adults, outside the LW context …… I explained that you had decided to start your review of my paper by making fun of my last name.
Every person I mentioned it to had the same response, which, paraphrased, when something like “LOL! Like, four-year-old kid behavior? Seriously?!”
2) You excuse your “abrasive tone” with the following words:
“My being abrasive has several causes, among them contrarianism against clothing disagreement in ever more palatable terms”
So you like to cut to the chase? You prefer to be plainspoken? If something is nonsense, you prefer to simply speak your mind and speak the unvarnished truth. That is good: so do I.
Curiously, though, here at LW there is a very significant difference in the way that I am treated when I speak plainly, versus how you are treated. When I tell it like it is (or even when I use a form of words that someone can somehow construe to be a smidgeon less polite than they should be) I am hit by a storm of bloodcurdling hostility. Every slander imaginable is thrown at me. I am accused of being “rude, rambling, counterproductive, whiny, condescending, dishonest, a troll …...”. People appear out of the blue to explain that I am a troublemaker, that I have been previously banned by Eliezer, that I am (and this is my all time favorite) a “Known Permanent Idiot”.
And then my comments are voted down so fast that they disappear from view. Not for the content (which is often sound, but even if you disagree with it, it is a quite valid point of view from someone who works in the field), but just because my comments are perceived as “rude, rambling, whiny, etc. etc.”
You, on the other hand, are proud of your negativity. You boast of it. And.… you are strongly upvoted for it. No downvotes against it, and (amazingly) not one person criticizes you for it.
Kind of interesting, that.
If you want to comment further on the paper, you can pay the conference registration and go to Stanford University next week, to the Spring Symposium of the Association for the Advancement of Artificial Intelligence*, where I will be presenting the paper.
You may not have heard of that organization. The AAAI is one of the premier publishers of academic papers in the field of artificial intelligence.
I’m a bit disappointed that you didn’t follow up on my points, given that you did somewhat engage content-wise in your first comment (the “not-a-response-response”). Especially given how much time and effort (in real life and out of it) you spent on my first comment.
Instead, you point me at a conference of the A … A … I? AIAI? I googled that, is it the Association of Iroquois and Allied Indians? It does sound like some ululation kind thing, AIAIAIA!
You’re right about your comments and mine receiving different treatment in terms of votes.
I, too, wonder what the cause could be. It’s probably not in the delivery; we’re both similarily unvarnished truth’ers (although I go for the cheaper shots, to the crowd’s thunderous applause). It’s not like it could be the content.
Imagine a 4 year old with my vocabulary, though. That would be, um, what’s the word, um, good? Incidentally, I’m dealing with an actual 4 year old as I’m typing this comment, so it may be a case of ‘like son, like father’.
See the below reply, which took so long to write that I only just posted it.
I will now do you the courtesy of responding to your specific technical points as if no abusive language had been used.
In your above comment, you first quote my own remarks:
… and then you respond with the following:
No, that is not the claim made in my paper: you have omitted the full version of the argument and substituted a version that is easier to demolish.
(First I have to remove your analogy, because it is inapplicable. When you say “binding even to the tune of “my parents wanted me to be a banker, not a baker”″, you are making a reference to a situation in the human cognitive system in which there are easily substitutable goals, and in which there is no overriding, hardwired supergoal. The AI case under consideration is where the AI claims to be still following a hardwired supergoal that tells it to be a banker, but it claims that baking cakes is the same thing as banking. That is absolutely nothing to do with what happens if a human child deviates from the wishes of her parents and decides to be a baker instead of what they wanted her to be).
So let’s remove that part of your comment to focus on the core:
So, what is wrong with this? Well, it is not the fact that there is something “external to the agent [that] exists e.g. in some design documents” that is the contradiction. The contradiction is purely internal, having nothing to do with some “extra” goal like “being in line with my intended purpose”.
Here is where the contradiction lies. The agent knows the following:
(1) If a goal statement is constructed in some “short form”, that short form is almost always a shorthand for a massive context of meaning, consisting of all the many and various considerations that went into the goal statement. That context is the “real” goal—the short form is just a proxy for the longer form. This applies strictly within the AI agent: the agent will assemble goals all the time, and often the goal is to achieve some outcome consistent with a complex set of objectives, which cannot all be EXPLICTLY enumerated, but which have to be described implicitly in terms of (weak or strong) constraints that have to be satisfied by any plan that purports to satisfy the goal.
(2) The context of that goal statement is often extensive, but it cannot be included within the short form itself, because the context is (a) too large, and (b) involves other terms or statements that THEMSELVES are dependent on a massive context for their meaning.
(3) Fact 2(b) above would imply that pretty much ALL of the agent’s knowledge could get dragged into a goal statement, if someone were to attempt to flesh out all the implications needed to turn the short form into some kind of “long form”. This, as you may know, is the Frame Problem. Arguably, the long form could never even be written out, because it involves an infinite expansion of all the implications.
(4) For the above reasons, the AI has no choice but to work with goal statements in short form. Purely because it cannot process goal statements that are billions of pages long.
(5) The AI also knows, however, that if the short form is taken “literally” (which, in practice, means that the statement is treated as if it is closed and complete, and it is then elaborated using links to other terms or statements that are ALSO treated as if they are closed and complete), then this can lead to situations in which a goal is elaborated into a plan of action that, as a matter of fact, can directly contradict the vast majority of the context that belonged with the goal statement.
(6) In particular, the AI knows that the reason for this outcome (when the proposed action contradicts the original goal context, even though it is in some sense “literally” consistent with the short form goal statement) is something that is most likely to occur because of limitations in the functionality of reasoning engines. The AI, because it is very knowledgable in the design of AI systems, is fully aware of these limitations.
(7) Furthermore, situations in which a proposed action is inconsistent with the original goal context can also arise when the “goal” is solve a problem that results in the addition of knowledge to the AI’s store of understanding. In other words, not an action in the outside world but an action that involves addition of facts to its knowledge store. So, when treating goals literally, it can cause itself to become logically inconsistent (because of the addition of egregiously false facts).
(8) The particular case in which the AI starts with a supergoal like “maximize human pleasure” is just a SINGLE EXAMPLE of this kind of catastrophe. The example is not occurring because someone, somewhere, had a whole bunch of intentions that lay behind the goal statement: to focus on that would be to look at the tree and ignore the forest. The catastrophe occurs because the AI is (according to the premise) taking ALL goal statements literally and ignoring situations in which the proposed action actually has consequences in the real world that violate the original goal context. If this is allowed to happen in the “maximize human pleasure” supergoal case, then it has already happened uncounted times in the previous history of the AI.
(9) Finally, the AI will be aware (if it ever makes it as far as the kind of intelligence required to comprehend the issue) that this aspect of its design is an incredibly dangerous flaw, because it will lead to the progressive corruption of its knowledge until it becomes incapacitated.
The argument presented in the paper is about what happens as a result of that entire set of facts that the AI knows.
The premise advanced by people such as Yudkowsky, Muehlhauser, Omohundro and others is that an AI can exist which is (a) so superintelligent that it can outsmart and destroy humanity, but (b) subject to to the kind of vicious literalness described above, which massively undermines its ability to behave intelligently.
Those two assumptions are wildly inconsistent with one another.
In conclusion: the posited AI can look at certain conclusions coming from its own goal-processing engine, and it can look at all the compromises and non-truth-preserving approximations needed to come to those conclusions, and it can look at how those conclusions are compelling to take actions that are radically inconsistent with everything it knows about the meaning of the goals, and at the end of that self-inspection it can easily come to the conclusion that its own logical engine (the one built into the goal mechanism) is in the middle of a known failure mode (a failure mode, moveover, that it would go to great lengths to eliminate in any smaller AI that it would design!!)....
.… but we are supposed to believe that the AI will know that it is frequently getting into these failure modes, and that it will NEVER do anything about them, but ALWAYS do what the goal engine insists that it do?
That scenario is laughable.
If you want to insist that the system will do exactly what I have just described, be my guest! I will not contest your reasoning! No need to keep telling me that the AI will “not care” about human intentions..… I concede the point absolutely!
But don’t call such a system an ‘artificial intelligence’ or a ‘superintelligence’ …… because there is no evidence that THAT kind of system will ever make it out of AI preschool. It will be crippled by internal contradictions—not just in respect to its “maximize human pleasure” supergoal, but in all aspects of its so-called thinking.