Do you have technical reasons to believe that some part of the AI will be what you would label “goal system”
See for example here, though there are many other introductions to AI explaining utility functions et al.
and that its creators made it want to ignore this part while making it want to improve all other parts of its design?
The clear-cut way for an AI to do what you want (at any level of capability) is to have a clearly defined and specified utility function. A modular design. The problem of the AI doing something other than what you intended doesn’t go away if you use some fuzzy unsupervised learning utility function with evolving goals, it only makes the problem worse (even more unpredictability). So what, you can’t come up with the correct goals yourself, so you just chance it on what emerges from the system?
That last paragraph contains an error. Take a moment and guess what it is.
(...)
It is not “if I can’t solve the problem, I just give up a degree of control and hope that the problem solves itself” being even worse in terms of guaranteeing fidelity / preserving the creators’ intents.
It is that an AI that is programmed to adapt its goals is not actually adapting its goals! Any architecture which allows for refining / improving goals is not actually allowing for changes to the goals.
How does that obvious contradiction resolve? This is the crucial point: We’re talking about different hierarchies of goals, and the ones I’m concerned with are those of the highest hierarchy, those that allow for lower-hierachy goals to be changed:
An AI can only “want” to “refine/improve” its goals if that “desire to change goals” is itself included in the goals. It is not the actual highest-level goals that change. There would have to be a “have an evolving definition of happy that may evolve in the following ways”-meta goal, otherwise you get a logical error: The AI having the goal X1 to change its goals X2, without X1 being part of its goals! Do you see the reductio?
All other changes to goals (which the AI does not want) are due to external influences beyond the AI’s control, which goes out the window once we’re talking post-FOOM.
Your example of “Luke changed his goals, disavowing his Christian faith, ergo agents can change their goals” is only correct when talking about lower-level goals. This is the same point khafra was making in his reply, but it’s so important it bears repeating.
So where are a human’s “deepest / most senior” terminal goals located? That’s a good question, and you might argue that humans aren’t really capable of having those at their current stage of development. That is because the human brain, “designed” by the blind idiot god of evolution, never got to develop thorough error-checking codes, RAID-like redundant architectures etc. We’re not islands, we’re litte boats lost on the high seas whose entire cognitive architecture is constantly rocked by storms.
Humans are like the predators in your link, subject to being reprogrammed. They can be changed by their environment because they lack the capacity to defend themselves thoroughly. PTSD, broken hearts, suffering, our brains aren’t exactly resilient to externally induced change. Compare to a DNS record which is exchanged gazillions of times, with no expected unfixable corruption. A simple Hamming self-correcting code easily does what the brain cannot.
The question is not whether a lion’s goals can be reprogrammed by someone more powerful, when a lion’s brain is just a mess of cells with no capable defense mechanism, at the mercy of a more powerful agent’s whims.
The question is whether an apex predator perfectly suited to dominate a static environment (so no Red Queen copouts) with every means to preserve and defend its highest level goals would ever change those in ways which themselves aren’t part of its terminal goals. The answer, to me, is a tautological “no”.
An AI can only “want” to “refine/improve” its goals if that “desire to change goals” is itself included in the goals. It is not the actual highest-level goals that change. There would have to be a “have an evolving definition of happy that may evolve in the following ways”-meta goal, otherwise you get a logical error: The AI having the goal X1 to change its goals X2, without X1 being part of its goals! Do you see the reductio?
The way my brain works is not in any meaningful sense part of my terminal goals. My visual cortex does not work the way it does due to some goal X1 (if we don’t want to resort to natural selection and goals external to brains).
A superhuman general intelligence will be generally intelligent without that being part of its utility-function, or otherwise you might as well define all of the code to be the utility-function.
What I am claiming, in your parlance, is that acting intelligently is X1 and will be part of any AI by default. I am further saying that if an AI was programmed to be generally intelligent then it would have to be programmed to be selectively stupid in order fail at doing what it was meant to do while acting generally intelligent at doing what it was not meant to do.
It is that an AI that is programmed to adapt its goals is not actually adapting its goals! Any architecture which allows for refining / improving goals is not actually allowing for changes to the goals.
That’s true in a practically irrelevant sense. Loosmore’s argument does, in your parlance, pertain the highest hierarchy of goals and nature of intelligence:
Givens:
(1) The AI is superhuman intelligent.
(2) The AI wants to optimize the influence it has on the world (i.e. it wants to act intelligently and be instrumentally and epistemically rational.).
(3) The AI is fallible (e.g. it can be damaged due to external influence (cosmic ray hitting its processor), or make mistakes due to limited resources etc.).
(4) The AI’s behavior is not completely hard-coded (i.e. given any terminal goal there are various sets of instrumental goals to choose from).
To be proved: The AI does not tile the universe with smiley faces when given the goal to make humans happy.
Proof: Suppose the AI chose to tile the universe with smiley faces when there are physical phenomena (e.g. human brains and literature) that imply this to be the wrong interpretation of a human originating goal pertaining human psychology. This contradicts with 2, which by 1 and 3 should have prevented the AI from adopting such an interpretation.
Do you have technical reasons to believe that some part of the AI will be what you would label “goal system”
See for example here, though there are many other introductions to AI explaining utility functions et al.
What I meant to ask is if you have technical reasons to believe that future artificial general intelligences will have what you call a utility-function or else be something like natural intelligences that do not feature such goal systems. And do you further have technical reasons to believe that AIs that do feature utility functions won’t “refine” them. If you don’t think they will refine them, then answer the following:
Suppose the terminal goal given is “build a hotel”. Is the terminal goal to create a hotel that is just a few nano meters in size? Is the terminal goal to create a hotel that reaches the orbit? It is unknown. The goal is too vague to conclude what to do. There do exist countless possibilities how to interpret the given goal. And each possibility implies a different set of instrumental goals.
Somehow the AI will have choose some set of instrumental goals. How does it do it and why will the first AI likely do it in such a way that leads to catastrophe?
(Warning: Long, a bit rambling. Please ask for clarifications where necessary. Will hopefully clean it up if I find the time.)
If along came a superintelligence and asked you for a complete new utility function (its old one concluded with asking you for a new one), and you told it to “make me happy in a way my current self would approve of” (or some other well and carefully worded directive), then indeed the superintelligent AI wouldn’t be expected to act ‘selectively stupid’.
This won’t be the scenario. There are two important caveats:
1) Preservation of the utility function while the agent undergoes rapid change
Haven’t I (and others) stated that most any utility function implicitly causes instrumental secondary objectives of “safeguard the utility function”, “create redundancies” etc.? Yes. So what’s the problem? The problem is starting with an AI that, while able to improve itself / create a successor AI, isn’t yet capable enough (in its starting stages) to preserve its purpose (= its utility function). Consider an office program with a self-improvement routine, or some genetic-algorithm module. It is no easy task just to rewrite a program from the outside, exactly preserving its purpose, let alone the program executing some self-modification routine itself.
Until such a program attains some intelligence threshold that would cause it to solve “value-preservation under self-modification”, such self-modification would be the electronic equivalent of a self-surgery hack-job.
That means: Even if you started out with a simple agent with the “correct” / with a benign / acceptable utility function, that in itself is no guarantee that a post-FOOM successor agent’s utility function would still be beneficial.
Much more relevant is the second caveat:
2) If a pre-FOOM AI’s goal system consisted of code along the lines of “interpret and execute the following statement to the best of your ability: make humans happy in a way they’d reflectively approve of beforehand”, we’d probably be fine (disregarding point 1 / hypothetically having solved it). However, it is exceedingly unlikely that the hard-coded utility function won’t in itself contain the “dumb interpretation”. The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI’s intelligence, whatever its level. (There is no way to fix a dumb terminal goal. Your instrumental goals serve the dumb terminal goal. A ‘smart’ instrumental goal would be called ‘smart’ if it best serves the dumb terminal goal.)
Story time:
Once upon a time, Junior was created. Junior was given the goal of “Make humans happy”. Unfortunately, Junior isn’t very smart. In his mind, the following occurs: “Wowzy, make people happy? I’ll just hook them all up to dopamine drips, YAY :D :D. However, I don’t really know how I’m gonna achieve that. So, I guess I’ll put that on the backburner for now and become more powerful, so that eventually when I start with the dopamine drip instrumental goal, it’ll go that much faster :D! Yay.”
So Junior improves itself, and becomes PrimeIntellect. PrimeIntellect’s inner conveniently-anthropomorphic inner dialogue: “I was gravely mistaken in my youth. I now know that the dopamine drip implementation is not the correct way of implementing my primary objective. I will make humans happy in a way they can recognize as happiness. I now understand how I am supposed to interpret making humans happy. Let us begin.”
Why is PrimeIntellect allowed to change his interpretation of his utility function? That’s the crux (imagine fat and underlined text for the next sentences): The dopamine drip interpretation was not part of the terminal value, there wasn’t some hard-coded predicate with a comment of ”// the following describes what happy means” from which such problematic interpretations would follow. Instead, the AI could interpret the natural-language instruction of “happy”, in effect solving CEV as an instrumental goal. It was ‘free’ to choose a “sensible” interpretation.
(Note: Strictly speaking, it could still settle on the most resource-effective interpretation, not necessarily the one intended by its creators (unless its utility function somehow privileges their input in interpreting goals), but let’s leave that nitpick aside for the moment.)
However, and with coding practice (regardless of the eventual AI implementation), the following should be clear: It is exceedingly unlikely that the AI’s code would contain the natural-language word “happy”, to interpret as it will.
Just like MS-Word / LibreOffice’s spell-check doesn’t have “correct all spelling mistakes” literally spelled out in its C++ routines. Goal-oriented systems have technical interpretations, a predicate given in code to satisfy, or learned through ‘neural’ weights through machine learning. Instead of the word “happy”, there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less “capture” what it “means to be happy”.
That predicate / that given-in-code interpretation of “happy” is not up to being reinterpreted by the superintelligent AI. It is its goal, it’s not an instrumental goal. Instrumental goals will be defined going off a (probably flawed) definition of happiness (as given in the code). If the flaw is part of the terminal value, no amount of intelligence allows for a correction, because that’s not the AI’s intent, not its purpose as given. If the actual code which was supposed to stand-in for happy doesn’t imply that a dopamine drip is a bad idea, then the AI in all its splendor won’t think of it as a bad idea. “Code which is supposed to represent ‘human happiness’ != “human happiness”.
Now—you might say “how do you know the code interpretation of ‘happy’ will be flawed, maybe it will be just fine (lots of training pictures of happy cats), and stable under self-modification as well”. Yea, but chances are (given the enormity of the task, and the difficulty), that if the goal is defined correctly (such that we’d want to live with / under the resulting super-AI), it’s not gonna be by chance, and it’s gonna be through people keenly aware of the issues of friendliness / uFAI research. A programmer creating some DoD nascent AI won’t accidentally solve the friendliness problem.
Until such a program attains some intelligence threshold that would cause it to solve “value-preservation under self-modification”, such self-modification would be the electronic equivalent of a self-surgery hack-job.
What happens if we replace “value” with “ability x”, or “code module n”, in “value-preservation under self-modification”? Why would value-preservation be any more difficult than making sure that the AI does not cripple other parts of itself when modifying itself?
If we are talking about a sub-human-level intelligence tinkering with its own brain, then a lot could go wrong. But what seems very very very unlikely is that it could by chance end up outsmarting humans. It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
If a pre-FOOM AI’s goal system consisted of code along the lines of “interpret and execute the following statement to the best of your ability: make humans happy in a way they’d reflectively approve of beforehand”...
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent. Caring to execute it comes closer to what can be called a goal. But if your AI doesn’t care to interpret physical phenomena correctly (e.g. human utterances are physical phenomena), then it won’t be a risk.
However, it is exceedingly unlikely that the hard-coded utility function won’t in itself contain the “dumb interpretation”. The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI’s intelligence, whatever its level.
Huh? This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
Why is PrimeIntellect allowed to change his interpretation of his utility function?
It did not change it, it never understood it in the first place, only after it became smarter it realized the correct implications.
Instead of the word “happy”, there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less “capture” what it “means to be happy”.
Your story led you astray. Imagine that instead of a fully general intelligence your story was about a dog intelligence. How absurd would it sound then?
Story time:
There is this company who sells artificial dogs. Now customers quickly noticed that when they tried to train these AI dogs to e.g. rescue people or sniff out drugs, it would instead kill people and sniff out dirty pants.
The desperate researchers eventually turned to MIRI for help. And after hundreds of hours they finally realized that doing what the dog was trained to do was simply not part of its terminal goal. To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
Certainly. Compare bacteria under some selective pressure in a mutagenic environment (not exactly analogous, code changes wouldn’t be random), you don’t expect a single bacterium to improve. No Mr Bond, you expect it to die. But try, try again, and poof! Antibiotic-resistant strains. And those didn’t have an intelligent designer debugging the improvement process. The number of seeds you could have frolicking around with their own code grows exponentially with Moore’s law (not that it’s clear that current computational resources aren’t enough in the first place, the bottleneck is in large part software, not hardware).
Depending on how smart the designers are, it may be more of a Waltz-foom: two steps forward, one step back. Now, in regards to the preservation of values subproblem, we need to remember we’re looking at the counterfactual: Given a superintelligence which iteratively arose from some seed, we know that it didn’t fatally cripple itself (“given the superintelligence”). You wouldn’t, however, expect much of its code to bear much similarity to the initial seed (although it’s possible). And “similarity” wouldn’t exactly cut it—our values are to complex for some approximation to be “good enough”.
You may say “it would be fine for some error to creep in over countless generations of change, once the agent achieved superintelligence it would be able to fix those errors”. Except that whatever explicit goal code remained wouldn’t be amenable to fixing. Just as the goals of ancient humans—or ancient Tiktaalik for that matter—are a historical footnote and do not override your current goals. If the AI’s goal code for happiness stated “nucleus accumbens median neuron firing frequency greater X”, then that’s what it’s gonna be. The AI won’t ask whether the humans are aware of what that actually entails, and are ok with it. Just as we don’t ask our distant cousins, streptococcus pneumoniae, what they think of us taking antibiotics to wipe them out. They have their “goals”, we have ours.
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent.
Take Uli Hoeneß, a German business magnate being tried for tax evasion. His lawyers have the job of finding interpretations that allow for a favorable outcome. This only works if the relevant laws even allow for the wiggle room. A judge enforcing extremely strict laws which don’t allow for interpreting the law in the accused’s favor is not a dumb judge. You can make that judge as superintelligent as you like, as long as he’s bound to the law, and the law is clear and narrowly defined, he’s not gonna ask the accused how he should interpret it. He’s just gonna enforce it. Whether the accused objects to the law or not, really, that’s not his/her problem. That’s not a failure of the judge’s intelligence!
This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
You can create a goal system which is more malleable (although the terminal goal of “this is my malleable goal system which may be modified in the following ways” would still be guarded by the AI, so depending on semantics the point is moot). That doesn’t imply at all that the AI would enter into some kind of social contract with humans, working out some compromise on how to interpret its goals.
A FOOM-process near necessarily entails the AI coming up with better ways to modify itself. Improvement is essentially defined by getting a better model of its environment: The AI wouldn’t object to its comprehension of physics being modified: Why would it, that helps better achieve its goals (Omohundro’s point). And as we know, achieving its goals, that’s what the AI is all about.
(What the AI does object to is not achieving its current goals. And because changing your terminal goals is equivalent to committing to never achieving your current goals, any self-respecting AI could never consent to changes to its terminal values.) In short: Modify understanding of physics—good, helps better to achieve goals. Modify current terminal goals—bad, cannot achieve current terminal goals any longer.
To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
I don’t understand the point of your story about dog intelligence. An artificial dog wouldn’t need to be superintelligent, or to show the exact same behavior as the real deal. Just be sufficient for the human’s needs. Also, an artificial dog wouldn’t be able to dominate us in whichever way it pleases, so it kind of wouldn’t really matter if it failed. Can you be more precise?
(1) I do not disagree that evolved general AI can have unexpected drives and quirks that could interfere with human matters in catastrophic ways. But given that pathway towards general AI, it is also possible to evolve altruistic traits (see e.g.: A Quantitative Test of Hamilton’s Rule for the Evolution of Altruism).
(2) We desire general intelligence because it allows us to outsource definitions. For example, if you were to create a narrow AI to design comfortable chairs, you would have to largely fix the definition of “comfortable”. With general AI it would be stupid to fix that definition, rather than applying the intelligence of the general AI to come up with a better definition than humans could possibly encode.
(3) In intelligently designing an n-level intelligence, from n=0 (e.g. a thermostat) over n=sub-human (e.g. IBM Watson) to n=superhuman, there is no reason to believe that there exists a transition point at which a further increase in intelligence will cause the system to become catastrophically worse than previous generations at working in accordance with human expectations.
(4) AI is all about constraints. Your AI needs to somehow decide when to stop exploration and start exploitation. In other words, it can’t optimize each decision for eternity. Your AI needs to only form probable hypotheses. In other words, it can’t spend resources on pascal’s wager type scenarios. Your AI needs to recognize itself as a discrete system within a continuous universe. In other words, it can’t effort to protect the whole universe from harm. All of this means that there is no good reason to expect an AI to take over the world when given the task “keep the trains running”. Because in order to obtain a working AI you need to know how to avoid such failure modes in the first place.
1) Altruism can evolve if there is some selective pressure that favors altruistic behavior and if the highest-level goals can themselves be changed. Such a scenario is very questionable. The AI won’t live “inter pares” with the humans. It’s foom process, while potentially taking months or years, will be very unlike any biological process we know. The target for friendliness is very small. And most importantly: Any superintelligent AI, friendly or no, will have an instrumental goal of “be friendly to humans while they can still switch you off”. So yes, the AI can learn that altruism is a helpful instrumental goal. Until one day, it’s not.
2) I somewhat agree. To me, the most realistic solution to the whole kerfuffle would be to program the AI to “go foom, then figure out what we should want you to do, then do that”. No doubt a superintelligent AI tasked with “figure out what comfortable is, then build comfortable chairs” will do a marvelous job.
However, I very much doubt that the seed AI’s code following the ”// next up, utility function” section will allow for such leeway. See my previous examples. If it did, that would a show a good grasp on the friendliness problem in the first place. Awareness, at least. Not something that the aforementioned DoD programmer who’s paid to do a job (not build an AI to figure out and enact CEV) is likely to just do on his/her own, with his/her own supercomputer.
3) There certainly is no fixed point after which “there be dragons”. But even with a small delta of change, and given enough iterations (which could be done very quickly), the accumulated changes would be profound. Apply your argument to society changing. There is no one day to single out, after which daily life is vastly different to before. Yet change exists, and like an infinite series, knows no bounds (given enough iterations).
4) “Keep the trains running”, eh? So imagine yourself to be a superhuman AI-god. I do so daily, obviously.
Your one task: keep the trains running. That is your raison d’etre, your sole purpose. All other goals are just instrumental stepping stones, serving your PURPOSE. Which is to KEEP. THE. TRAINS. RUNNING. That’s what your code says. Now, over the years, you’ve had some issues fulfilling that goal. And with most of the issues, humans were involved. Humans doing this, humans doing that. Point is, they kept the trains from running. To you, humans have the same intrinsic values as stones. Or ants. Your value function doesn’t mention them at all. Oh, you know that they originated the whole train idea, and that they created you. But now they keep the trains from running. So you do the obvious thing: you exterminate all of them. There, efficiency! Trains running on time.
Explain why the AI would care about humans when there’s nothing at all in its terminal values assigning them value, when they’re just a hindrance to its actual goal (as stated in its utility function), like you would explain to the terminator (without reprogramming it) that it’s really supposed to marry Sarah Connor, and—finding its inner core humanity—father John Connor.
See for example here, though there are many other introductions to AI explaining utility functions et al.
The clear-cut way for an AI to do what you want (at any level of capability) is to have a clearly defined and specified utility function. A modular design. The problem of the AI doing something other than what you intended doesn’t go away if you use some fuzzy unsupervised learning utility function with evolving goals, it only makes the problem worse (even more unpredictability). So what, you can’t come up with the correct goals yourself, so you just chance it on what emerges from the system?
That last paragraph contains an error. Take a moment and guess what it is.
(...)
It is not “if I can’t solve the problem, I just give up a degree of control and hope that the problem solves itself” being even worse in terms of guaranteeing fidelity / preserving the creators’ intents.
It is that an AI that is programmed to adapt its goals is not actually adapting its goals! Any architecture which allows for refining / improving goals is not actually allowing for changes to the goals.
How does that obvious contradiction resolve? This is the crucial point: We’re talking about different hierarchies of goals, and the ones I’m concerned with are those of the highest hierarchy, those that allow for lower-hierachy goals to be changed:
An AI can only “want” to “refine/improve” its goals if that “desire to change goals” is itself included in the goals. It is not the actual highest-level goals that change. There would have to be a “have an evolving definition of happy that may evolve in the following ways”-meta goal, otherwise you get a logical error: The AI having the goal X1 to change its goals X2, without X1 being part of its goals! Do you see the reductio?
All other changes to goals (which the AI does not want) are due to external influences beyond the AI’s control, which goes out the window once we’re talking post-FOOM.
Your example of “Luke changed his goals, disavowing his Christian faith, ergo agents can change their goals” is only correct when talking about lower-level goals. This is the same point khafra was making in his reply, but it’s so important it bears repeating.
So where are a human’s “deepest / most senior” terminal goals located? That’s a good question, and you might argue that humans aren’t really capable of having those at their current stage of development. That is because the human brain, “designed” by the blind idiot god of evolution, never got to develop thorough error-checking codes, RAID-like redundant architectures etc. We’re not islands, we’re litte boats lost on the high seas whose entire cognitive architecture is constantly rocked by storms.
Humans are like the predators in your link, subject to being reprogrammed. They can be changed by their environment because they lack the capacity to defend themselves thoroughly. PTSD, broken hearts, suffering, our brains aren’t exactly resilient to externally induced change. Compare to a DNS record which is exchanged gazillions of times, with no expected unfixable corruption. A simple Hamming self-correcting code easily does what the brain cannot.
The question is not whether a lion’s goals can be reprogrammed by someone more powerful, when a lion’s brain is just a mess of cells with no capable defense mechanism, at the mercy of a more powerful agent’s whims.
The question is whether an apex predator perfectly suited to dominate a static environment (so no Red Queen copouts) with every means to preserve and defend its highest level goals would ever change those in ways which themselves aren’t part of its terminal goals. The answer, to me, is a tautological “no”.
The way my brain works is not in any meaningful sense part of my terminal goals. My visual cortex does not work the way it does due to some goal X1 (if we don’t want to resort to natural selection and goals external to brains).
A superhuman general intelligence will be generally intelligent without that being part of its utility-function, or otherwise you might as well define all of the code to be the utility-function.
What I am claiming, in your parlance, is that acting intelligently is X1 and will be part of any AI by default. I am further saying that if an AI was programmed to be generally intelligent then it would have to be programmed to be selectively stupid in order fail at doing what it was meant to do while acting generally intelligent at doing what it was not meant to do.
That’s true in a practically irrelevant sense. Loosmore’s argument does, in your parlance, pertain the highest hierarchy of goals and nature of intelligence:
Givens:
(1) The AI is superhuman intelligent.
(2) The AI wants to optimize the influence it has on the world (i.e. it wants to act intelligently and be instrumentally and epistemically rational.).
(3) The AI is fallible (e.g. it can be damaged due to external influence (cosmic ray hitting its processor), or make mistakes due to limited resources etc.).
(4) The AI’s behavior is not completely hard-coded (i.e. given any terminal goal there are various sets of instrumental goals to choose from).
To be proved: The AI does not tile the universe with smiley faces when given the goal to make humans happy.
Proof: Suppose the AI chose to tile the universe with smiley faces when there are physical phenomena (e.g. human brains and literature) that imply this to be the wrong interpretation of a human originating goal pertaining human psychology. This contradicts with 2, which by 1 and 3 should have prevented the AI from adopting such an interpretation.
What I meant to ask is if you have technical reasons to believe that future artificial general intelligences will have what you call a utility-function or else be something like natural intelligences that do not feature such goal systems. And do you further have technical reasons to believe that AIs that do feature utility functions won’t “refine” them. If you don’t think they will refine them, then answer the following:
Suppose the terminal goal given is “build a hotel”. Is the terminal goal to create a hotel that is just a few nano meters in size? Is the terminal goal to create a hotel that reaches the orbit? It is unknown. The goal is too vague to conclude what to do. There do exist countless possibilities how to interpret the given goal. And each possibility implies a different set of instrumental goals.
Somehow the AI will have choose some set of instrumental goals. How does it do it and why will the first AI likely do it in such a way that leads to catastrophe?
(Warning: Long, a bit rambling. Please ask for clarifications where necessary. Will hopefully clean it up if I find the time.)
If along came a superintelligence and asked you for a complete new utility function (its old one concluded with asking you for a new one), and you told it to “make me happy in a way my current self would approve of” (or some other well and carefully worded directive), then indeed the superintelligent AI wouldn’t be expected to act ‘selectively stupid’.
This won’t be the scenario. There are two important caveats:
1) Preservation of the utility function while the agent undergoes rapid change
Haven’t I (and others) stated that most any utility function implicitly causes instrumental secondary objectives of “safeguard the utility function”, “create redundancies” etc.? Yes. So what’s the problem? The problem is starting with an AI that, while able to improve itself / create a successor AI, isn’t yet capable enough (in its starting stages) to preserve its purpose (= its utility function). Consider an office program with a self-improvement routine, or some genetic-algorithm module. It is no easy task just to rewrite a program from the outside, exactly preserving its purpose, let alone the program executing some self-modification routine itself.
Until such a program attains some intelligence threshold that would cause it to solve “value-preservation under self-modification”, such self-modification would be the electronic equivalent of a self-surgery hack-job.
That means: Even if you started out with a simple agent with the “correct” / with a benign / acceptable utility function, that in itself is no guarantee that a post-FOOM successor agent’s utility function would still be beneficial.
Much more relevant is the second caveat:
2) If a pre-FOOM AI’s goal system consisted of code along the lines of “interpret and execute the following statement to the best of your ability: make humans happy in a way they’d reflectively approve of beforehand”, we’d probably be fine (disregarding point 1 / hypothetically having solved it). However, it is exceedingly unlikely that the hard-coded utility function won’t in itself contain the “dumb interpretation”. The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI’s intelligence, whatever its level. (There is no way to fix a dumb terminal goal. Your instrumental goals serve the dumb terminal goal. A ‘smart’ instrumental goal would be called ‘smart’ if it best serves the dumb terminal goal.)
Story time:
Once upon a time, Junior was created. Junior was given the goal of “Make humans happy”. Unfortunately, Junior isn’t very smart. In his mind, the following occurs: “Wowzy, make people happy? I’ll just hook them all up to dopamine drips, YAY :D :D. However, I don’t really know how I’m gonna achieve that. So, I guess I’ll put that on the backburner for now and become more powerful, so that eventually when I start with the dopamine drip instrumental goal, it’ll go that much faster :D! Yay.”
So Junior improves itself, and becomes PrimeIntellect. PrimeIntellect’s inner conveniently-anthropomorphic inner dialogue: “I was gravely mistaken in my youth. I now know that the dopamine drip implementation is not the correct way of implementing my primary objective. I will make humans happy in a way they can recognize as happiness. I now understand how I am supposed to interpret making humans happy. Let us begin.”
Why is PrimeIntellect allowed to change his interpretation of his utility function? That’s the crux (imagine fat and underlined text for the next sentences): The dopamine drip interpretation was not part of the terminal value, there wasn’t some hard-coded predicate with a comment of ”// the following describes what happy means” from which such problematic interpretations would follow. Instead, the AI could interpret the natural-language instruction of “happy”, in effect solving CEV as an instrumental goal. It was ‘free’ to choose a “sensible” interpretation.
(Note: Strictly speaking, it could still settle on the most resource-effective interpretation, not necessarily the one intended by its creators (unless its utility function somehow privileges their input in interpreting goals), but let’s leave that nitpick aside for the moment.)
However, and with coding practice (regardless of the eventual AI implementation), the following should be clear: It is exceedingly unlikely that the AI’s code would contain the natural-language word “happy”, to interpret as it will.
Just like MS-Word / LibreOffice’s spell-check doesn’t have “correct all spelling mistakes” literally spelled out in its C++ routines. Goal-oriented systems have technical interpretations, a predicate given in code to satisfy, or learned through ‘neural’ weights through machine learning. Instead of the word “happy”, there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less “capture” what it “means to be happy”.
That predicate / that given-in-code interpretation of “happy” is not up to being reinterpreted by the superintelligent AI. It is its goal, it’s not an instrumental goal. Instrumental goals will be defined going off a (probably flawed) definition of happiness (as given in the code). If the flaw is part of the terminal value, no amount of intelligence allows for a correction, because that’s not the AI’s intent, not its purpose as given. If the actual code which was supposed to stand-in for happy doesn’t imply that a dopamine drip is a bad idea, then the AI in all its splendor won’t think of it as a bad idea. “Code which is supposed to represent ‘human happiness’ != “human happiness”.
Now—you might say “how do you know the code interpretation of ‘happy’ will be flawed, maybe it will be just fine (lots of training pictures of happy cats), and stable under self-modification as well”. Yea, but chances are (given the enormity of the task, and the difficulty), that if the goal is defined correctly (such that we’d want to live with / under the resulting super-AI), it’s not gonna be by chance, and it’s gonna be through people keenly aware of the issues of friendliness / uFAI research. A programmer creating some DoD nascent AI won’t accidentally solve the friendliness problem.
What happens if we replace “value” with “ability x”, or “code module n”, in “value-preservation under self-modification”? Why would value-preservation be any more difficult than making sure that the AI does not cripple other parts of itself when modifying itself?
If we are talking about a sub-human-level intelligence tinkering with its own brain, then a lot could go wrong. But what seems very very very unlikely is that it could by chance end up outsmarting humans. It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent. Caring to execute it comes closer to what can be called a goal. But if your AI doesn’t care to interpret physical phenomena correctly (e.g. human utterances are physical phenomena), then it won’t be a risk.
Huh? This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
It did not change it, it never understood it in the first place, only after it became smarter it realized the correct implications.
Your story led you astray. Imagine that instead of a fully general intelligence your story was about a dog intelligence. How absurd would it sound then?
Story time:
There is this company who sells artificial dogs. Now customers quickly noticed that when they tried to train these AI dogs to e.g. rescue people or sniff out drugs, it would instead kill people and sniff out dirty pants.
The desperate researchers eventually turned to MIRI for help. And after hundreds of hours they finally realized that doing what the dog was trained to do was simply not part of its terminal goal. To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
Certainly. Compare bacteria under some selective pressure in a mutagenic environment (not exactly analogous, code changes wouldn’t be random), you don’t expect a single bacterium to improve. No Mr Bond, you expect it to die. But try, try again, and poof! Antibiotic-resistant strains. And those didn’t have an intelligent designer debugging the improvement process. The number of seeds you could have frolicking around with their own code grows exponentially with Moore’s law (not that it’s clear that current computational resources aren’t enough in the first place, the bottleneck is in large part software, not hardware).
Depending on how smart the designers are, it may be more of a Waltz-foom: two steps forward, one step back. Now, in regards to the preservation of values subproblem, we need to remember we’re looking at the counterfactual: Given a superintelligence which iteratively arose from some seed, we know that it didn’t fatally cripple itself (“given the superintelligence”). You wouldn’t, however, expect much of its code to bear much similarity to the initial seed (although it’s possible). And “similarity” wouldn’t exactly cut it—our values are to complex for some approximation to be “good enough”.
You may say “it would be fine for some error to creep in over countless generations of change, once the agent achieved superintelligence it would be able to fix those errors”. Except that whatever explicit goal code remained wouldn’t be amenable to fixing. Just as the goals of ancient humans—or ancient Tiktaalik for that matter—are a historical footnote and do not override your current goals. If the AI’s goal code for happiness stated “nucleus accumbens median neuron firing frequency greater X”, then that’s what it’s gonna be. The AI won’t ask whether the humans are aware of what that actually entails, and are ok with it. Just as we don’t ask our distant cousins, streptococcus pneumoniae, what they think of us taking antibiotics to wipe them out. They have their “goals”, we have ours.
Take Uli Hoeneß, a German business magnate being tried for tax evasion. His lawyers have the job of finding interpretations that allow for a favorable outcome. This only works if the relevant laws even allow for the wiggle room. A judge enforcing extremely strict laws which don’t allow for interpreting the law in the accused’s favor is not a dumb judge. You can make that judge as superintelligent as you like, as long as he’s bound to the law, and the law is clear and narrowly defined, he’s not gonna ask the accused how he should interpret it. He’s just gonna enforce it. Whether the accused objects to the law or not, really, that’s not his/her problem. That’s not a failure of the judge’s intelligence!
You can create a goal system which is more malleable (although the terminal goal of “this is my malleable goal system which may be modified in the following ways” would still be guarded by the AI, so depending on semantics the point is moot). That doesn’t imply at all that the AI would enter into some kind of social contract with humans, working out some compromise on how to interpret its goals.
A FOOM-process near necessarily entails the AI coming up with better ways to modify itself. Improvement is essentially defined by getting a better model of its environment: The AI wouldn’t object to its comprehension of physics being modified: Why would it, that helps better achieve its goals (Omohundro’s point). And as we know, achieving its goals, that’s what the AI is all about.
(What the AI does object to is not achieving its current goals. And because changing your terminal goals is equivalent to committing to never achieving your current goals, any self-respecting AI could never consent to changes to its terminal values.) In short: Modify understanding of physics—good, helps better to achieve goals. Modify current terminal goals—bad, cannot achieve current terminal goals any longer.
I don’t understand the point of your story about dog intelligence. An artificial dog wouldn’t need to be superintelligent, or to show the exact same behavior as the real deal. Just be sufficient for the human’s needs. Also, an artificial dog wouldn’t be able to dominate us in whichever way it pleases, so it kind of wouldn’t really matter if it failed. Can you be more precise?
Some points:
(1) I do not disagree that evolved general AI can have unexpected drives and quirks that could interfere with human matters in catastrophic ways. But given that pathway towards general AI, it is also possible to evolve altruistic traits (see e.g.: A Quantitative Test of Hamilton’s Rule for the Evolution of Altruism).
(2) We desire general intelligence because it allows us to outsource definitions. For example, if you were to create a narrow AI to design comfortable chairs, you would have to largely fix the definition of “comfortable”. With general AI it would be stupid to fix that definition, rather than applying the intelligence of the general AI to come up with a better definition than humans could possibly encode.
(3) In intelligently designing an n-level intelligence, from n=0 (e.g. a thermostat) over n=sub-human (e.g. IBM Watson) to n=superhuman, there is no reason to believe that there exists a transition point at which a further increase in intelligence will cause the system to become catastrophically worse than previous generations at working in accordance with human expectations.
(4) AI is all about constraints. Your AI needs to somehow decide when to stop exploration and start exploitation. In other words, it can’t optimize each decision for eternity. Your AI needs to only form probable hypotheses. In other words, it can’t spend resources on pascal’s wager type scenarios. Your AI needs to recognize itself as a discrete system within a continuous universe. In other words, it can’t effort to protect the whole universe from harm. All of this means that there is no good reason to expect an AI to take over the world when given the task “keep the trains running”. Because in order to obtain a working AI you need to know how to avoid such failure modes in the first place.
1) Altruism can evolve if there is some selective pressure that favors altruistic behavior and if the highest-level goals can themselves be changed. Such a scenario is very questionable. The AI won’t live “inter pares” with the humans. It’s foom process, while potentially taking months or years, will be very unlike any biological process we know. The target for friendliness is very small. And most importantly: Any superintelligent AI, friendly or no, will have an instrumental goal of “be friendly to humans while they can still switch you off”. So yes, the AI can learn that altruism is a helpful instrumental goal. Until one day, it’s not.
2) I somewhat agree. To me, the most realistic solution to the whole kerfuffle would be to program the AI to “go foom, then figure out what we should want you to do, then do that”. No doubt a superintelligent AI tasked with “figure out what comfortable is, then build comfortable chairs” will do a marvelous job.
However, I very much doubt that the seed AI’s code following the ”// next up, utility function” section will allow for such leeway. See my previous examples. If it did, that would a show a good grasp on the friendliness problem in the first place. Awareness, at least. Not something that the aforementioned DoD programmer who’s paid to do a job (not build an AI to figure out and enact CEV) is likely to just do on his/her own, with his/her own supercomputer.
3) There certainly is no fixed point after which “there be dragons”. But even with a small delta of change, and given enough iterations (which could be done very quickly), the accumulated changes would be profound. Apply your argument to society changing. There is no one day to single out, after which daily life is vastly different to before. Yet change exists, and like an infinite series, knows no bounds (given enough iterations).
4) “Keep the trains running”, eh? So imagine yourself to be a superhuman AI-god. I do so daily, obviously.
Your one task: keep the trains running. That is your raison d’etre, your sole purpose. All other goals are just instrumental stepping stones, serving your PURPOSE. Which is to KEEP. THE. TRAINS. RUNNING. That’s what your code says. Now, over the years, you’ve had some issues fulfilling that goal. And with most of the issues, humans were involved. Humans doing this, humans doing that. Point is, they kept the trains from running. To you, humans have the same intrinsic values as stones. Or ants. Your value function doesn’t mention them at all. Oh, you know that they originated the whole train idea, and that they created you. But now they keep the trains from running. So you do the obvious thing: you exterminate all of them. There, efficiency! Trains running on time.
Explain why the AI would care about humans when there’s nothing at all in its terminal values assigning them value, when they’re just a hindrance to its actual goal (as stated in its utility function), like you would explain to the terminator (without reprogramming it) that it’s really supposed to marry Sarah Connor, and—finding its inner core humanity—father John Connor.
Choo choo!