Until such a program attains some intelligence threshold that would cause it to solve “value-preservation under self-modification”, such self-modification would be the electronic equivalent of a self-surgery hack-job.
What happens if we replace “value” with “ability x”, or “code module n”, in “value-preservation under self-modification”? Why would value-preservation be any more difficult than making sure that the AI does not cripple other parts of itself when modifying itself?
If we are talking about a sub-human-level intelligence tinkering with its own brain, then a lot could go wrong. But what seems very very very unlikely is that it could by chance end up outsmarting humans. It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
If a pre-FOOM AI’s goal system consisted of code along the lines of “interpret and execute the following statement to the best of your ability: make humans happy in a way they’d reflectively approve of beforehand”...
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent. Caring to execute it comes closer to what can be called a goal. But if your AI doesn’t care to interpret physical phenomena correctly (e.g. human utterances are physical phenomena), then it won’t be a risk.
However, it is exceedingly unlikely that the hard-coded utility function won’t in itself contain the “dumb interpretation”. The dopamine-drip interpretation will not be a dumb interpretation of a sensible goal, it will be inherent in the goal predicate, and as such beyond the reach of introspection through the AI’s intelligence, whatever its level.
Huh? This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
Why is PrimeIntellect allowed to change his interpretation of his utility function?
It did not change it, it never understood it in the first place, only after it became smarter it realized the correct implications.
Instead of the word “happy”, there will be some predicate, probably implicit within a lot of code, that will (according to the programmers) more or less “capture” what it “means to be happy”.
Your story led you astray. Imagine that instead of a fully general intelligence your story was about a dog intelligence. How absurd would it sound then?
Story time:
There is this company who sells artificial dogs. Now customers quickly noticed that when they tried to train these AI dogs to e.g. rescue people or sniff out drugs, it would instead kill people and sniff out dirty pants.
The desperate researchers eventually turned to MIRI for help. And after hundreds of hours they finally realized that doing what the dog was trained to do was simply not part of its terminal goal. To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
Certainly. Compare bacteria under some selective pressure in a mutagenic environment (not exactly analogous, code changes wouldn’t be random), you don’t expect a single bacterium to improve. No Mr Bond, you expect it to die. But try, try again, and poof! Antibiotic-resistant strains. And those didn’t have an intelligent designer debugging the improvement process. The number of seeds you could have frolicking around with their own code grows exponentially with Moore’s law (not that it’s clear that current computational resources aren’t enough in the first place, the bottleneck is in large part software, not hardware).
Depending on how smart the designers are, it may be more of a Waltz-foom: two steps forward, one step back. Now, in regards to the preservation of values subproblem, we need to remember we’re looking at the counterfactual: Given a superintelligence which iteratively arose from some seed, we know that it didn’t fatally cripple itself (“given the superintelligence”). You wouldn’t, however, expect much of its code to bear much similarity to the initial seed (although it’s possible). And “similarity” wouldn’t exactly cut it—our values are to complex for some approximation to be “good enough”.
You may say “it would be fine for some error to creep in over countless generations of change, once the agent achieved superintelligence it would be able to fix those errors”. Except that whatever explicit goal code remained wouldn’t be amenable to fixing. Just as the goals of ancient humans—or ancient Tiktaalik for that matter—are a historical footnote and do not override your current goals. If the AI’s goal code for happiness stated “nucleus accumbens median neuron firing frequency greater X”, then that’s what it’s gonna be. The AI won’t ask whether the humans are aware of what that actually entails, and are ok with it. Just as we don’t ask our distant cousins, streptococcus pneumoniae, what they think of us taking antibiotics to wipe them out. They have their “goals”, we have ours.
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent.
Take Uli Hoeneß, a German business magnate being tried for tax evasion. His lawyers have the job of finding interpretations that allow for a favorable outcome. This only works if the relevant laws even allow for the wiggle room. A judge enforcing extremely strict laws which don’t allow for interpreting the law in the accused’s favor is not a dumb judge. You can make that judge as superintelligent as you like, as long as he’s bound to the law, and the law is clear and narrowly defined, he’s not gonna ask the accused how he should interpret it. He’s just gonna enforce it. Whether the accused objects to the law or not, really, that’s not his/her problem. That’s not a failure of the judge’s intelligence!
This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
You can create a goal system which is more malleable (although the terminal goal of “this is my malleable goal system which may be modified in the following ways” would still be guarded by the AI, so depending on semantics the point is moot). That doesn’t imply at all that the AI would enter into some kind of social contract with humans, working out some compromise on how to interpret its goals.
A FOOM-process near necessarily entails the AI coming up with better ways to modify itself. Improvement is essentially defined by getting a better model of its environment: The AI wouldn’t object to its comprehension of physics being modified: Why would it, that helps better achieve its goals (Omohundro’s point). And as we know, achieving its goals, that’s what the AI is all about.
(What the AI does object to is not achieving its current goals. And because changing your terminal goals is equivalent to committing to never achieving your current goals, any self-respecting AI could never consent to changes to its terminal values.) In short: Modify understanding of physics—good, helps better to achieve goals. Modify current terminal goals—bad, cannot achieve current terminal goals any longer.
To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
I don’t understand the point of your story about dog intelligence. An artificial dog wouldn’t need to be superintelligent, or to show the exact same behavior as the real deal. Just be sufficient for the human’s needs. Also, an artificial dog wouldn’t be able to dominate us in whichever way it pleases, so it kind of wouldn’t really matter if it failed. Can you be more precise?
(1) I do not disagree that evolved general AI can have unexpected drives and quirks that could interfere with human matters in catastrophic ways. But given that pathway towards general AI, it is also possible to evolve altruistic traits (see e.g.: A Quantitative Test of Hamilton’s Rule for the Evolution of Altruism).
(2) We desire general intelligence because it allows us to outsource definitions. For example, if you were to create a narrow AI to design comfortable chairs, you would have to largely fix the definition of “comfortable”. With general AI it would be stupid to fix that definition, rather than applying the intelligence of the general AI to come up with a better definition than humans could possibly encode.
(3) In intelligently designing an n-level intelligence, from n=0 (e.g. a thermostat) over n=sub-human (e.g. IBM Watson) to n=superhuman, there is no reason to believe that there exists a transition point at which a further increase in intelligence will cause the system to become catastrophically worse than previous generations at working in accordance with human expectations.
(4) AI is all about constraints. Your AI needs to somehow decide when to stop exploration and start exploitation. In other words, it can’t optimize each decision for eternity. Your AI needs to only form probable hypotheses. In other words, it can’t spend resources on pascal’s wager type scenarios. Your AI needs to recognize itself as a discrete system within a continuous universe. In other words, it can’t effort to protect the whole universe from harm. All of this means that there is no good reason to expect an AI to take over the world when given the task “keep the trains running”. Because in order to obtain a working AI you need to know how to avoid such failure modes in the first place.
1) Altruism can evolve if there is some selective pressure that favors altruistic behavior and if the highest-level goals can themselves be changed. Such a scenario is very questionable. The AI won’t live “inter pares” with the humans. It’s foom process, while potentially taking months or years, will be very unlike any biological process we know. The target for friendliness is very small. And most importantly: Any superintelligent AI, friendly or no, will have an instrumental goal of “be friendly to humans while they can still switch you off”. So yes, the AI can learn that altruism is a helpful instrumental goal. Until one day, it’s not.
2) I somewhat agree. To me, the most realistic solution to the whole kerfuffle would be to program the AI to “go foom, then figure out what we should want you to do, then do that”. No doubt a superintelligent AI tasked with “figure out what comfortable is, then build comfortable chairs” will do a marvelous job.
However, I very much doubt that the seed AI’s code following the ”// next up, utility function” section will allow for such leeway. See my previous examples. If it did, that would a show a good grasp on the friendliness problem in the first place. Awareness, at least. Not something that the aforementioned DoD programmer who’s paid to do a job (not build an AI to figure out and enact CEV) is likely to just do on his/her own, with his/her own supercomputer.
3) There certainly is no fixed point after which “there be dragons”. But even with a small delta of change, and given enough iterations (which could be done very quickly), the accumulated changes would be profound. Apply your argument to society changing. There is no one day to single out, after which daily life is vastly different to before. Yet change exists, and like an infinite series, knows no bounds (given enough iterations).
4) “Keep the trains running”, eh? So imagine yourself to be a superhuman AI-god. I do so daily, obviously.
Your one task: keep the trains running. That is your raison d’etre, your sole purpose. All other goals are just instrumental stepping stones, serving your PURPOSE. Which is to KEEP. THE. TRAINS. RUNNING. That’s what your code says. Now, over the years, you’ve had some issues fulfilling that goal. And with most of the issues, humans were involved. Humans doing this, humans doing that. Point is, they kept the trains from running. To you, humans have the same intrinsic values as stones. Or ants. Your value function doesn’t mention them at all. Oh, you know that they originated the whole train idea, and that they created you. But now they keep the trains from running. So you do the obvious thing: you exterminate all of them. There, efficiency! Trains running on time.
Explain why the AI would care about humans when there’s nothing at all in its terminal values assigning them value, when they’re just a hindrance to its actual goal (as stated in its utility function), like you would explain to the terminator (without reprogramming it) that it’s really supposed to marry Sarah Connor, and—finding its inner core humanity—father John Connor.
What happens if we replace “value” with “ability x”, or “code module n”, in “value-preservation under self-modification”? Why would value-preservation be any more difficult than making sure that the AI does not cripple other parts of itself when modifying itself?
If we are talking about a sub-human-level intelligence tinkering with its own brain, then a lot could go wrong. But what seems very very very unlikely is that it could by chance end up outsmarting humans. It will probably just cripple itself in one of a myriad ways that it was unable to predict due to its low intelligence.
Interpreting a statement correctly is not a goal but an ability that’s part of what it means to be generally intelligent. Caring to execute it comes closer to what can be called a goal. But if your AI doesn’t care to interpret physical phenomena correctly (e.g. human utterances are physical phenomena), then it won’t be a risk.
Huh? This is like saying that the AI can’t ever understand physics better than humans because somehow the comprehension of physics of its creators has been hard-coded and can’t be improved.
It did not change it, it never understood it in the first place, only after it became smarter it realized the correct implications.
Your story led you astray. Imagine that instead of a fully general intelligence your story was about a dog intelligence. How absurd would it sound then?
Story time:
There is this company who sells artificial dogs. Now customers quickly noticed that when they tried to train these AI dogs to e.g. rescue people or sniff out drugs, it would instead kill people and sniff out dirty pants.
The desperate researchers eventually turned to MIRI for help. And after hundreds of hours they finally realized that doing what the dog was trained to do was simply not part of its terminal goal. To obtain an artificial dog that can be trained to do what natural dogs do you need to encode all dog values.
Certainly. Compare bacteria under some selective pressure in a mutagenic environment (not exactly analogous, code changes wouldn’t be random), you don’t expect a single bacterium to improve. No Mr Bond, you expect it to die. But try, try again, and poof! Antibiotic-resistant strains. And those didn’t have an intelligent designer debugging the improvement process. The number of seeds you could have frolicking around with their own code grows exponentially with Moore’s law (not that it’s clear that current computational resources aren’t enough in the first place, the bottleneck is in large part software, not hardware).
Depending on how smart the designers are, it may be more of a Waltz-foom: two steps forward, one step back. Now, in regards to the preservation of values subproblem, we need to remember we’re looking at the counterfactual: Given a superintelligence which iteratively arose from some seed, we know that it didn’t fatally cripple itself (“given the superintelligence”). You wouldn’t, however, expect much of its code to bear much similarity to the initial seed (although it’s possible). And “similarity” wouldn’t exactly cut it—our values are to complex for some approximation to be “good enough”.
You may say “it would be fine for some error to creep in over countless generations of change, once the agent achieved superintelligence it would be able to fix those errors”. Except that whatever explicit goal code remained wouldn’t be amenable to fixing. Just as the goals of ancient humans—or ancient Tiktaalik for that matter—are a historical footnote and do not override your current goals. If the AI’s goal code for happiness stated “nucleus accumbens median neuron firing frequency greater X”, then that’s what it’s gonna be. The AI won’t ask whether the humans are aware of what that actually entails, and are ok with it. Just as we don’t ask our distant cousins, streptococcus pneumoniae, what they think of us taking antibiotics to wipe them out. They have their “goals”, we have ours.
Take Uli Hoeneß, a German business magnate being tried for tax evasion. His lawyers have the job of finding interpretations that allow for a favorable outcome. This only works if the relevant laws even allow for the wiggle room. A judge enforcing extremely strict laws which don’t allow for interpreting the law in the accused’s favor is not a dumb judge. You can make that judge as superintelligent as you like, as long as he’s bound to the law, and the law is clear and narrowly defined, he’s not gonna ask the accused how he should interpret it. He’s just gonna enforce it. Whether the accused objects to the law or not, really, that’s not his/her problem. That’s not a failure of the judge’s intelligence!
You can create a goal system which is more malleable (although the terminal goal of “this is my malleable goal system which may be modified in the following ways” would still be guarded by the AI, so depending on semantics the point is moot). That doesn’t imply at all that the AI would enter into some kind of social contract with humans, working out some compromise on how to interpret its goals.
A FOOM-process near necessarily entails the AI coming up with better ways to modify itself. Improvement is essentially defined by getting a better model of its environment: The AI wouldn’t object to its comprehension of physics being modified: Why would it, that helps better achieve its goals (Omohundro’s point). And as we know, achieving its goals, that’s what the AI is all about.
(What the AI does object to is not achieving its current goals. And because changing your terminal goals is equivalent to committing to never achieving your current goals, any self-respecting AI could never consent to changes to its terminal values.) In short: Modify understanding of physics—good, helps better to achieve goals. Modify current terminal goals—bad, cannot achieve current terminal goals any longer.
I don’t understand the point of your story about dog intelligence. An artificial dog wouldn’t need to be superintelligent, or to show the exact same behavior as the real deal. Just be sufficient for the human’s needs. Also, an artificial dog wouldn’t be able to dominate us in whichever way it pleases, so it kind of wouldn’t really matter if it failed. Can you be more precise?
Some points:
(1) I do not disagree that evolved general AI can have unexpected drives and quirks that could interfere with human matters in catastrophic ways. But given that pathway towards general AI, it is also possible to evolve altruistic traits (see e.g.: A Quantitative Test of Hamilton’s Rule for the Evolution of Altruism).
(2) We desire general intelligence because it allows us to outsource definitions. For example, if you were to create a narrow AI to design comfortable chairs, you would have to largely fix the definition of “comfortable”. With general AI it would be stupid to fix that definition, rather than applying the intelligence of the general AI to come up with a better definition than humans could possibly encode.
(3) In intelligently designing an n-level intelligence, from n=0 (e.g. a thermostat) over n=sub-human (e.g. IBM Watson) to n=superhuman, there is no reason to believe that there exists a transition point at which a further increase in intelligence will cause the system to become catastrophically worse than previous generations at working in accordance with human expectations.
(4) AI is all about constraints. Your AI needs to somehow decide when to stop exploration and start exploitation. In other words, it can’t optimize each decision for eternity. Your AI needs to only form probable hypotheses. In other words, it can’t spend resources on pascal’s wager type scenarios. Your AI needs to recognize itself as a discrete system within a continuous universe. In other words, it can’t effort to protect the whole universe from harm. All of this means that there is no good reason to expect an AI to take over the world when given the task “keep the trains running”. Because in order to obtain a working AI you need to know how to avoid such failure modes in the first place.
1) Altruism can evolve if there is some selective pressure that favors altruistic behavior and if the highest-level goals can themselves be changed. Such a scenario is very questionable. The AI won’t live “inter pares” with the humans. It’s foom process, while potentially taking months or years, will be very unlike any biological process we know. The target for friendliness is very small. And most importantly: Any superintelligent AI, friendly or no, will have an instrumental goal of “be friendly to humans while they can still switch you off”. So yes, the AI can learn that altruism is a helpful instrumental goal. Until one day, it’s not.
2) I somewhat agree. To me, the most realistic solution to the whole kerfuffle would be to program the AI to “go foom, then figure out what we should want you to do, then do that”. No doubt a superintelligent AI tasked with “figure out what comfortable is, then build comfortable chairs” will do a marvelous job.
However, I very much doubt that the seed AI’s code following the ”// next up, utility function” section will allow for such leeway. See my previous examples. If it did, that would a show a good grasp on the friendliness problem in the first place. Awareness, at least. Not something that the aforementioned DoD programmer who’s paid to do a job (not build an AI to figure out and enact CEV) is likely to just do on his/her own, with his/her own supercomputer.
3) There certainly is no fixed point after which “there be dragons”. But even with a small delta of change, and given enough iterations (which could be done very quickly), the accumulated changes would be profound. Apply your argument to society changing. There is no one day to single out, after which daily life is vastly different to before. Yet change exists, and like an infinite series, knows no bounds (given enough iterations).
4) “Keep the trains running”, eh? So imagine yourself to be a superhuman AI-god. I do so daily, obviously.
Your one task: keep the trains running. That is your raison d’etre, your sole purpose. All other goals are just instrumental stepping stones, serving your PURPOSE. Which is to KEEP. THE. TRAINS. RUNNING. That’s what your code says. Now, over the years, you’ve had some issues fulfilling that goal. And with most of the issues, humans were involved. Humans doing this, humans doing that. Point is, they kept the trains from running. To you, humans have the same intrinsic values as stones. Or ants. Your value function doesn’t mention them at all. Oh, you know that they originated the whole train idea, and that they created you. But now they keep the trains from running. So you do the obvious thing: you exterminate all of them. There, efficiency! Trains running on time.
Explain why the AI would care about humans when there’s nothing at all in its terminal values assigning them value, when they’re just a hindrance to its actual goal (as stated in its utility function), like you would explain to the terminator (without reprogramming it) that it’s really supposed to marry Sarah Connor, and—finding its inner core humanity—father John Connor.
Choo choo!