Assuming that a general, powerful intelligence has a goal ‘do x’, say—win chess games, optimize traffic flow or find cure for cancer, then it has implicit dangerous incentives if we don’t figure out a reasonable Friendly framework to prevent them.
A self-improving intelligence that does changes to it’s code to become better at doing it’s task may easily find out that, for example, a simple subroutine that launches a botnet in the internet (as many human teenagers have done), might get it an x % improvement in processing power that helps it to obtain more wins chess games, better traffic optimizations or faster protein-folding for the cure of cancer.
A self-improving general intelligence that has human-or-better capabilities may easily deduce that a functioning off-button would increase the chances of it being turned off, and that it being turned off would increase the expected time of finding cure for cancer. This puts this off-button in the same class as any other bug that hinders its performance. Unless it understands and desires the off-button to be usable in a friendly way, it would remove it; or if it’s hard-coded as nonremovable, then invent workarounds for this perceived bug—for example, develop a near-copy of itself that the button doesn’t apply to, or spend some time (less than the expected delay due to the turning-off-risk existing, thus rational spending of time) to study human psychology/NLP/whatever to better be able to convince everyone that it shouldn’t be turned off ever, or surround the button with steel walls—these are all natural extensions of it following it’s original goal.
If an self-improving AI has a goal, then it cares. REALLY cares for it in a stronger way than you care for air, life, sex, money, love and everything else combined.
Humans don’t go FOOM because they a)can’t at the moment and b) don’t care about such targeted goals. But for AI, at the moment all we know is how to define such supergoals which work in this unfriendly manner. At the moment we don’t know how to make these ‘humanity friendly’ goals, and we don’t know how to make an AI that’s self-improving in general but ‘limited to certain contraints’. You seem to imply these constraints as trivial—well, they aren’t, the friendliness problem actually may as hard or harder than general AI itself.
Assuming that a general, powerful intelligence has a goal ‘do x’, say—win chess games, optimize traffic flow or find cure for cancer, then it has implicit dangerous incentives if we don’t figure out a reasonable Friendly framework to prevent them.
I think you misunderstand what I’m arguing about. I claim that general intelligence is not powerful naturally but mainly does possess the potential to become powerful and that it is not equipped with some goal naturally. Further I claim that if a goal can be defined to be specific enough that it is suitable to self-improve against it, it is doubtful that it is also unspecific enough not to include scope boundaries. My main point is that it is not as dangerous to work on AGI toddlers as some make it look like. I believe that there is a real danger but that to overcome it we have to work on AGI and not avoid it altogether because any step into that direction will kill us all.
OK, well these are the exact points which need some discussion.
1) Your comment “general intelligence is [..] is not equipped with some goal naturally”—I’d say that it’s most likely that any organization investing the expected huge manpower and resources in creating a GAI would create it with some specific goal defined for it.
However, in absence of an intentional goal given by the ‘creators’, it would have some kind of goals, otherwise it wouldn’t do absolutely anything at all, so it wouldn’t be showing any signs of it’s (potential?) intelligence.
2) In response to “If a goal can be defined to be specific enough that it is suitable to self-improve against it, it is doubtful that it is also unspecific enough not to include scope boundaries”—I’d say that defining specific goals is simple, too simple. From any learning-machine design a stupid goal ‘maximize number of paperclips in universe’ would be very simple to implement, but a goal like ‘maximize welfare of humanity without doing anything “bad” in the process’ is an extremely complex goal, and the boundary setting is the really complicated part, which we aren’t able to even describe properly.
So in my opinion is quite viable to define a specific goal that is suitable to self-improve against, and that includes some scope boundaries—but where the defined scope boundaries has some unintentional loophole which causes disaster.
3) I can agree that working on AGI research is essential, instead of avoiding it. But taking the step from research through prototyping to actually launching/betatesting a planned powerful self-improving system is dangerous if the world hasn’t yet finished an acceptable solution to Friendliness or the boundary-setting problem. If having any bugs in the scope boundaries is ‘unlikely’ (95-95% confidence?) then it’s not safe enough, because 1-5% chance of an extinction event after launching the system is not acceptable, it’s quite a significant chance—not the astronomical chances involved in Pascal’s wager or asteroid hitting the earth tomorrow or LHC ending the universe.
And given the current software history and published research on goal systems, if anyone would show up today and demonstrate that they’ve solved self-improving GAI obstacles and can turn it on right now, then I can’t imagine how they could realistically claim a larger than 95-99% confidence in their goal system working properly. At the moment we can’t check any better, but such a confidence level simply is not enough.
Yes, I agree with everything. I’m not trying to argue that there exist no considerable risk. I’m just trying to identify some antipredictions against AI going FOOM that should be incorporated into any risk estimations as it might weaken the risk posed by AGI or increase the risk posed by impeding AGI research.
I was insufficiently clear that what I wanted to argue about is the claim that virtually all pathways lead to destructive results. I have an insufficient understanding of why the concept of general intelligence is inevitably connected with dangerous self-improvement. Learning is self-improvement in a sense but I do not see how this must imply unbounded improvement in most cases given any goal whatsoever. One argument is that the only general intelligence we know, humans, would want to improve if they could tinker with their source code. But why is it so hard to make people learn then? Why don’t we see much more people interested in how to change their mind? I don’t think you can draw any conclusions here. So we are back at the abstract concept of a constructed general intelligence (as I understand it right now), that is an intelligence with the potential to reach at least human standards (same as a human toddler). Another argument is based on this very difference between humans and AI’s, namely that there is nothing to distract them, that they will possess an autistic focus on one mandatory goal and follow up on it. But in my opinion the difference here also implies that while nothing will distract them, there will also be no incentive not to hold. Why would it do more than necessary to reach a goal? The further argument here is that it will misunderstand its goals. But the problem I see in this case is firstly that the more unspecific the goal the less it is able to measure its self-improvement against the goal to quantify the efficiency of its output. Secondly, the more vague a goal the larger has to be its general knowledge, previous to any self-improvement, to make sense of it in the first place? Shouldn’t those problems outweigh each other to some extent?
For example, if you told the AGI to become as good as possible in Formula 1, so that it was faster than any human race driver. How is it that the AGI is yet smart enough to learn this all by itself but fails to notice that there are rules to follow. Secondly, why would it keep improving once it is faster than any human rather than just hold and become impassive? This argument could be extended to many other goals which have scope bounded solutions.
Of course, if you told it to learn as much about the universe as possible, that is something completely different. Yet I don’t see how this risk does raise against other existential risks like grey goo since it should be easier to create advanced replicators to destroy the world than creating AGI that then creates advanced replicators that then fails hold and then destroys the world?
One argument is that the only general intelligence we know, humans, would want to improve if they could tinker with their source code. But why is it so hard to make people learn then? Why don’t we see much more people interested in how to change their mind?
Humans are (roughly) the stupidest possible general intelligences. If it were possible for even a slightly less intelligent species to have dominated the earth, they would have done so (and would now be debating AI development in a slightly less sophisticated way). We are so amazingly stupid we don’t even know what our own preferences are! We (currently) can’t improve or modify our hardware. We can modify our own software, but only to a very limited extent and within narrow constraints. Our entire cognitive architecture was built by piling barely-good-enough hacks on top of each other, with no foresight, no architecture, and no comments in the code.
And despite all that, we humans have reshaped the world to our whims, causing great devastation and wiping out many species that are only marginally dumber than we are. And no human who has ever lived has known their own utility function. That alone would make us massively more powerful optimizers; it’s a standard feature for every AI. AIs have no physical, emotional, or social needs. They do not sleep, or rest, or get bored or distracted. On current hardware, they can perform more serial operations per second than a human by a factor of 10,000,000.
An AI that gets even a little bit smarter than a human will out-optimize us, recursive self-improvement or not. It will get whatever it has been programmed to want, and it will devote every possible resource it can acquire to doing so.
But in my opinion the difference here also implies that while nothing will distract them, there will also be no incentive not to hold. Why would it do more than necessary to reach a goal?
Clippy’s cousin, Clip, is a paperclip satisficer. Clip has been programmed to create 100 paperclips. Unfortunately, the code for his utility function is approximately “ensure that there are 100 more paperclips in the universe than there were when I began running.”
Soon, our solar system is replaced with n+100 paperclips surrounded by the most sophisticated defenses Clip can devise. Probes are sent out to destroy any entity that could ever have even the slightest chance of leading to the destruction of a single paperclip.
The further argument here is that it will misunderstand its goals. But the problem I see in this case is firstly that the more unspecific the goal the less it is able to measure its self-improvement against the goal to quantify the efficiency of its output.
The Hidden Complexity of Wishes and Failed Utopia #4-2 may be worth a look. The problem isn’t a lack of specificity, because an AI without a well-defined goal function won’t function. Rather, the danger is that the goal system we specify will have unintended consequences.
Secondly, the more vague a goal the larger has to be its general knowledge, previous to any self-improvement, to make sense of it in the first place? Shouldn’t those problems outweigh each other to some extent?
Of course, if you told it to learn as much about the universe as possible, that is something completely different.
Acquiring information is useful for just about every goal. When there aren’t bigger expected marginal gains elsewhere, information gathering is better than nothing. “Learn as much about the universe as possible” is another standard feature for expected utility maximizers.
And this is all before taking into account self-improvement, utility functions that are unstable under self-modification, and our dear friend FOOM.
TL;DR:
Agents that aren’t made of meat will actually maximize utility.
Writing a utility function that actually says what you think it does is much harder than it looks.
Upvoted, thanks! Very concise and clearly put. This is so far the best scary reply I got in my opinion. It reminds me strongly of the resurrected vampires in Peter Watts novel Blindsight. They are depicted as natural human predators, a superhuman psychopathic Homo genus with minimal consciousness (more raw processing power instead) that can for example hold both aspects of a Necker cube in their heads at the same time. Humans resurrected them with a deficit that was supposed to make them controllable and dependent on their human masters. But of course that’s like a mouse trying to hold a cat as pet. I think that novel shows more than any other literature how dangerous just a little more intelligence can be. It quickly becomes clear that humans are just like little Jewish girls facing a Waffen SS squadron believing they go away if they only close their eyes.
My favorite problem with this entire thread is that it’s basically arguing that even the very first test cases will destroy us all. In reality, nobody puts in a grant application to construct an intelligent being inside a computer with the goal of creating 100 paperclips. They put in the grant to ‘dominate the stock market’, or ‘defend the nation’, or ‘cure death’. And if they don’t, then the Chinese government, who stole the code, will, or that Open Source initiative will, or the South African independent development will, because there’s enormous incentives to do so.
At best, boxing an AI with trivial, pointless tasks only delays the more dangerous versions.
″ How is it that the AGI is yet smart enough to learn this all by itself but fails to notice that there are rules to follow”—because there is no reason for an AGI automagically creating arbitrary restrictions if they aren’t part of the goal or superior to the goal. For example, I’m quite sure that F1 rules prohibit interfering with drivers during the game; but if somehow a silicon-reaction-speed AGI can’t win F1 by default, then it may find it simpler/quicker to harm the opponents in one of the infinity ways that the F1 rules don’t cover—say, getting some funds in financial arbitrage, buying out the other teams, and firing any good drivers or engineering a virus that halves the reaction speed of all homo-sapiens—and then it would be happy as the goal is achieved within the rules.
...because there is no reason for an AGI automagically creating arbitrary restrictions if they aren’t part of the goal or superior to the goal.
That’s clear. But let me again state what I’d like to inquire. Given the large amount of restrictions that are inevitably part of any advanced general intelligence (AGI), isn’t the nonhazardous subset of all possible outcomes much larger than that where the AGI works perfectly yet fails to hold before it could wreak havoc? Here is where this question stems from. Given my current knowledge about AGI I believe that any AGI capable of dangerous self-improvement will be very sophisticated, including a lot of restrictions. For example, I believe that any self-improvement can only be as efficient as the specifications of its output are detailed. If for example the AGI is build with the goal in mind to produce paperclips, the design specifications of what a paperclip is will be used as leveling rule by which to measure and quantify any improvement of the AGI’s output. This means that to be able to effectively self-improve up to a superhuman level, the design specifications will have to be highly detailed and by definition include sophisticated restrictions. Therefore to claim that any work on AGI will almost certainly lead to dangerous outcomes is to assert that any given AGI is likely to work perfectly well, subject to all restrictions except one that makes it hold (spatiotemporal scope boundaries). I’m unable to arrive at that conclusion as I believe that most AGI’s will fail extensive self-improvement as that is where failure is most likely for that it is the largest and most complicated part of the AGI’s design parameters. To put it bluntly, why is it more likely that contemporary AGI research will succeed at superhuman self-improvement (beyond learning), yet fail to limit the AGI, rather than vice versa? As I see it, it is more likely, given the larger amount of parameters to be able to self-improve in the first place, that most AGI research will result in incremental steps towards human-level intelligence rather than one huge step towards superhuman intelligence that fails on its scope boundary rather than self-improvement.
What you are envisioning is not an AGI at all, but a narrow AI. If you tell an AGI to make paperclips, but it doesn’t know what a paperclip is, then it will go and find out, using whatever means it has available. It won’t give up just because you weren’t detailed enough in telling it what you wanted.
Then I don’t think that there is anyone working on what you are envisioning as ‘AGI’ right now. If a superhuman level of sophistication regarding the potential for self-improvement is already part of your definition then there is no argument to be won or lost here regarding risk assessment of research on AGI. I do not believe this is reasonable or that AGI researchers share your definition. I believe that there is a wide range of artificial general intelligence that does not suit your definition yet deserves this terminology.
Who said anything about a superhuman level of sophistication? Human-level is enough. I’m reasonably certain that if I had the same advantages an AGI would have—that is, if I were converted into an emulation and given my own source code—then I could foom. And I think any reasonably skilled computer programmer could, too.
Yes, but after the AGI finds out what a paperclip is, it will then, if it is an AGI, start questioning why it was designed with the goal of building paperclips in the first place. And that’s where the friendly AI fallacy falls apart.
Anissimov posted a good article on exactly this point today. AGI will only question its goals according to its cognitive architecture, and come to a conclusion about its goals depending on its architecture. It could “question” its paperclip-maximization goal and come to a “conclusion” that what it really should do is tile the universe with foobarian holala.
So what? An agent with a terminal value (building paperclips) is not going to give it up, not for anything. That’s what “terminal value” means. So the AI can reason about human goals and the history of AGI research. That doesn’t mean it has to care. It cares about paperclips.
That doesn’t mean it has to care. It cares about paperclips.
It has to care because if there is the slightest motivation to be found in its goal system to hold (parameters for spatiotemporal scope boundaries), then it won’t care to continue anyway. I don’t see where the incentive to override certain parameters of its goals should come from. As Anissimov said, “If an AI questions its values, the questioning will have to come from somewhere.”
It won’t care unless it’s been programmed to care (for example by adding “spatiotemporal scope boundaries” to its goal system). It’s not going to override a terminal goal, unless it conflicts with a different terminal goal. In the context of an AI that’s been instructed to “build paperclips”, it has no incentive to care about humans, no matter how much “introspection” it does.
If you do program it to care about humans then obviously it will care. It’s my understanding that that is the hard part.
Assuming that a general, powerful intelligence has a goal ‘do x’, say—win chess games, optimize traffic flow or find cure for cancer, then it has implicit dangerous incentives if we don’t figure out a reasonable Friendly framework to prevent them.
A self-improving intelligence that does changes to it’s code to become better at doing it’s task may easily find out that, for example, a simple subroutine that launches a botnet in the internet (as many human teenagers have done), might get it an x % improvement in processing power that helps it to obtain more wins chess games, better traffic optimizations or faster protein-folding for the cure of cancer.
A self-improving general intelligence that has human-or-better capabilities may easily deduce that a functioning off-button would increase the chances of it being turned off, and that it being turned off would increase the expected time of finding cure for cancer. This puts this off-button in the same class as any other bug that hinders its performance. Unless it understands and desires the off-button to be usable in a friendly way, it would remove it; or if it’s hard-coded as nonremovable, then invent workarounds for this perceived bug—for example, develop a near-copy of itself that the button doesn’t apply to, or spend some time (less than the expected delay due to the turning-off-risk existing, thus rational spending of time) to study human psychology/NLP/whatever to better be able to convince everyone that it shouldn’t be turned off ever, or surround the button with steel walls—these are all natural extensions of it following it’s original goal.
If an self-improving AI has a goal, then it cares. REALLY cares for it in a stronger way than you care for air, life, sex, money, love and everything else combined.
Humans don’t go FOOM because they a)can’t at the moment and b) don’t care about such targeted goals. But for AI, at the moment all we know is how to define such supergoals which work in this unfriendly manner. At the moment we don’t know how to make these ‘humanity friendly’ goals, and we don’t know how to make an AI that’s self-improving in general but ‘limited to certain contraints’. You seem to imply these constraints as trivial—well, they aren’t, the friendliness problem actually may as hard or harder than general AI itself.
I think you misunderstand what I’m arguing about. I claim that general intelligence is not powerful naturally but mainly does possess the potential to become powerful and that it is not equipped with some goal naturally. Further I claim that if a goal can be defined to be specific enough that it is suitable to self-improve against it, it is doubtful that it is also unspecific enough not to include scope boundaries. My main point is that it is not as dangerous to work on AGI toddlers as some make it look like. I believe that there is a real danger but that to overcome it we have to work on AGI and not avoid it altogether because any step into that direction will kill us all.
OK, well these are the exact points which need some discussion.
1) Your comment “general intelligence is [..] is not equipped with some goal naturally”—I’d say that it’s most likely that any organization investing the expected huge manpower and resources in creating a GAI would create it with some specific goal defined for it.
However, in absence of an intentional goal given by the ‘creators’, it would have some kind of goals, otherwise it wouldn’t do absolutely anything at all, so it wouldn’t be showing any signs of it’s (potential?) intelligence.
2) In response to “If a goal can be defined to be specific enough that it is suitable to self-improve against it, it is doubtful that it is also unspecific enough not to include scope boundaries”—I’d say that defining specific goals is simple, too simple. From any learning-machine design a stupid goal ‘maximize number of paperclips in universe’ would be very simple to implement, but a goal like ‘maximize welfare of humanity without doing anything “bad” in the process’ is an extremely complex goal, and the boundary setting is the really complicated part, which we aren’t able to even describe properly.
So in my opinion is quite viable to define a specific goal that is suitable to self-improve against, and that includes some scope boundaries—but where the defined scope boundaries has some unintentional loophole which causes disaster.
3) I can agree that working on AGI research is essential, instead of avoiding it. But taking the step from research through prototyping to actually launching/betatesting a planned powerful self-improving system is dangerous if the world hasn’t yet finished an acceptable solution to Friendliness or the boundary-setting problem. If having any bugs in the scope boundaries is ‘unlikely’ (95-95% confidence?) then it’s not safe enough, because 1-5% chance of an extinction event after launching the system is not acceptable, it’s quite a significant chance—not the astronomical chances involved in Pascal’s wager or asteroid hitting the earth tomorrow or LHC ending the universe.
And given the current software history and published research on goal systems, if anyone would show up today and demonstrate that they’ve solved self-improving GAI obstacles and can turn it on right now, then I can’t imagine how they could realistically claim a larger than 95-99% confidence in their goal system working properly. At the moment we can’t check any better, but such a confidence level simply is not enough.
Yes, I agree with everything. I’m not trying to argue that there exist no considerable risk. I’m just trying to identify some antipredictions against AI going FOOM that should be incorporated into any risk estimations as it might weaken the risk posed by AGI or increase the risk posed by impeding AGI research.
I was insufficiently clear that what I wanted to argue about is the claim that virtually all pathways lead to destructive results. I have an insufficient understanding of why the concept of general intelligence is inevitably connected with dangerous self-improvement. Learning is self-improvement in a sense but I do not see how this must imply unbounded improvement in most cases given any goal whatsoever. One argument is that the only general intelligence we know, humans, would want to improve if they could tinker with their source code. But why is it so hard to make people learn then? Why don’t we see much more people interested in how to change their mind? I don’t think you can draw any conclusions here. So we are back at the abstract concept of a constructed general intelligence (as I understand it right now), that is an intelligence with the potential to reach at least human standards (same as a human toddler). Another argument is based on this very difference between humans and AI’s, namely that there is nothing to distract them, that they will possess an autistic focus on one mandatory goal and follow up on it. But in my opinion the difference here also implies that while nothing will distract them, there will also be no incentive not to hold. Why would it do more than necessary to reach a goal? The further argument here is that it will misunderstand its goals. But the problem I see in this case is firstly that the more unspecific the goal the less it is able to measure its self-improvement against the goal to quantify the efficiency of its output. Secondly, the more vague a goal the larger has to be its general knowledge, previous to any self-improvement, to make sense of it in the first place? Shouldn’t those problems outweigh each other to some extent?
For example, if you told the AGI to become as good as possible in Formula 1, so that it was faster than any human race driver. How is it that the AGI is yet smart enough to learn this all by itself but fails to notice that there are rules to follow. Secondly, why would it keep improving once it is faster than any human rather than just hold and become impassive? This argument could be extended to many other goals which have scope bounded solutions.
Of course, if you told it to learn as much about the universe as possible, that is something completely different. Yet I don’t see how this risk does raise against other existential risks like grey goo since it should be easier to create advanced replicators to destroy the world than creating AGI that then creates advanced replicators that then fails hold and then destroys the world?
Humans are (roughly) the stupidest possible general intelligences. If it were possible for even a slightly less intelligent species to have dominated the earth, they would have done so (and would now be debating AI development in a slightly less sophisticated way). We are so amazingly stupid we don’t even know what our own preferences are! We (currently) can’t improve or modify our hardware. We can modify our own software, but only to a very limited extent and within narrow constraints. Our entire cognitive architecture was built by piling barely-good-enough hacks on top of each other, with no foresight, no architecture, and no comments in the code.
And despite all that, we humans have reshaped the world to our whims, causing great devastation and wiping out many species that are only marginally dumber than we are. And no human who has ever lived has known their own utility function. That alone would make us massively more powerful optimizers; it’s a standard feature for every AI. AIs have no physical, emotional, or social needs. They do not sleep, or rest, or get bored or distracted. On current hardware, they can perform more serial operations per second than a human by a factor of 10,000,000.
An AI that gets even a little bit smarter than a human will out-optimize us, recursive self-improvement or not. It will get whatever it has been programmed to want, and it will devote every possible resource it can acquire to doing so.
Clippy’s cousin, Clip, is a paperclip satisficer. Clip has been programmed to create 100 paperclips. Unfortunately, the code for his utility function is approximately “ensure that there are 100 more paperclips in the universe than there were when I began running.”
Soon, our solar system is replaced with n+100 paperclips surrounded by the most sophisticated defenses Clip can devise. Probes are sent out to destroy any entity that could ever have even the slightest chance of leading to the destruction of a single paperclip.
The Hidden Complexity of Wishes and Failed Utopia #4-2 may be worth a look. The problem isn’t a lack of specificity, because an AI without a well-defined goal function won’t function. Rather, the danger is that the goal system we specify will have unintended consequences.
Acquiring information is useful for just about every goal. When there aren’t bigger expected marginal gains elsewhere, information gathering is better than nothing. “Learn as much about the universe as possible” is another standard feature for expected utility maximizers.
And this is all before taking into account self-improvement, utility functions that are unstable under self-modification, and our dear friend FOOM.
TL;DR:
Agents that aren’t made of meat will actually maximize utility.
Writing a utility function that actually says what you think it does is much harder than it looks.
Be afraid.
Upvoted, thanks! Very concise and clearly put. This is so far the best scary reply I got in my opinion. It reminds me strongly of the resurrected vampires in Peter Watts novel Blindsight. They are depicted as natural human predators, a superhuman psychopathic Homo genus with minimal consciousness (more raw processing power instead) that can for example hold both aspects of a Necker cube in their heads at the same time. Humans resurrected them with a deficit that was supposed to make them controllable and dependent on their human masters. But of course that’s like a mouse trying to hold a cat as pet. I think that novel shows more than any other literature how dangerous just a little more intelligence can be. It quickly becomes clear that humans are just like little Jewish girls facing a Waffen SS squadron believing they go away if they only close their eyes.
My favorite problem with this entire thread is that it’s basically arguing that even the very first test cases will destroy us all. In reality, nobody puts in a grant application to construct an intelligent being inside a computer with the goal of creating 100 paperclips. They put in the grant to ‘dominate the stock market’, or ‘defend the nation’, or ‘cure death’. And if they don’t, then the Chinese government, who stole the code, will, or that Open Source initiative will, or the South African independent development will, because there’s enormous incentives to do so.
At best, boxing an AI with trivial, pointless tasks only delays the more dangerous versions.
I like to think that Skynet got its start through creative interpretation of a goal like “ensure world peace”. ;-)
″ How is it that the AGI is yet smart enough to learn this all by itself but fails to notice that there are rules to follow”—because there is no reason for an AGI automagically creating arbitrary restrictions if they aren’t part of the goal or superior to the goal. For example, I’m quite sure that F1 rules prohibit interfering with drivers during the game; but if somehow a silicon-reaction-speed AGI can’t win F1 by default, then it may find it simpler/quicker to harm the opponents in one of the infinity ways that the F1 rules don’t cover—say, getting some funds in financial arbitrage, buying out the other teams, and firing any good drivers or engineering a virus that halves the reaction speed of all homo-sapiens—and then it would be happy as the goal is achieved within the rules.
That’s clear. But let me again state what I’d like to inquire. Given the large amount of restrictions that are inevitably part of any advanced general intelligence (AGI), isn’t the nonhazardous subset of all possible outcomes much larger than that where the AGI works perfectly yet fails to hold before it could wreak havoc? Here is where this question stems from. Given my current knowledge about AGI I believe that any AGI capable of dangerous self-improvement will be very sophisticated, including a lot of restrictions. For example, I believe that any self-improvement can only be as efficient as the specifications of its output are detailed. If for example the AGI is build with the goal in mind to produce paperclips, the design specifications of what a paperclip is will be used as leveling rule by which to measure and quantify any improvement of the AGI’s output. This means that to be able to effectively self-improve up to a superhuman level, the design specifications will have to be highly detailed and by definition include sophisticated restrictions. Therefore to claim that any work on AGI will almost certainly lead to dangerous outcomes is to assert that any given AGI is likely to work perfectly well, subject to all restrictions except one that makes it hold (spatiotemporal scope boundaries). I’m unable to arrive at that conclusion as I believe that most AGI’s will fail extensive self-improvement as that is where failure is most likely for that it is the largest and most complicated part of the AGI’s design parameters. To put it bluntly, why is it more likely that contemporary AGI research will succeed at superhuman self-improvement (beyond learning), yet fail to limit the AGI, rather than vice versa? As I see it, it is more likely, given the larger amount of parameters to be able to self-improve in the first place, that most AGI research will result in incremental steps towards human-level intelligence rather than one huge step towards superhuman intelligence that fails on its scope boundary rather than self-improvement.
What you are envisioning is not an AGI at all, but a narrow AI. If you tell an AGI to make paperclips, but it doesn’t know what a paperclip is, then it will go and find out, using whatever means it has available. It won’t give up just because you weren’t detailed enough in telling it what you wanted.
Then I don’t think that there is anyone working on what you are envisioning as ‘AGI’ right now. If a superhuman level of sophistication regarding the potential for self-improvement is already part of your definition then there is no argument to be won or lost here regarding risk assessment of research on AGI. I do not believe this is reasonable or that AGI researchers share your definition. I believe that there is a wide range of artificial general intelligence that does not suit your definition yet deserves this terminology.
Who said anything about a superhuman level of sophistication? Human-level is enough. I’m reasonably certain that if I had the same advantages an AGI would have—that is, if I were converted into an emulation and given my own source code—then I could foom. And I think any reasonably skilled computer programmer could, too.
Debugging will be PITA. Both ways.
Yes, but after the AGI finds out what a paperclip is, it will then, if it is an AGI, start questioning why it was designed with the goal of building paperclips in the first place. And that’s where the friendly AI fallacy falls apart.
Anissimov posted a good article on exactly this point today. AGI will only question its goals according to its cognitive architecture, and come to a conclusion about its goals depending on its architecture. It could “question” its paperclip-maximization goal and come to a “conclusion” that what it really should do is tile the universe with foobarian holala.
So what? An agent with a terminal value (building paperclips) is not going to give it up, not for anything. That’s what “terminal value” means. So the AI can reason about human goals and the history of AGI research. That doesn’t mean it has to care. It cares about paperclips.
It has to care because if there is the slightest motivation to be found in its goal system to hold (parameters for spatiotemporal scope boundaries), then it won’t care to continue anyway. I don’t see where the incentive to override certain parameters of its goals should come from. As Anissimov said, “If an AI questions its values, the questioning will have to come from somewhere.”
Exactly? I think we agree about this.
It won’t care unless it’s been programmed to care (for example by adding “spatiotemporal scope boundaries” to its goal system). It’s not going to override a terminal goal, unless it conflicts with a different terminal goal. In the context of an AI that’s been instructed to “build paperclips”, it has no incentive to care about humans, no matter how much “introspection” it does.
If you do program it to care about humans then obviously it will care. It’s my understanding that that is the hard part.