″ How is it that the AGI is yet smart enough to learn this all by itself but fails to notice that there are rules to follow”—because there is no reason for an AGI automagically creating arbitrary restrictions if they aren’t part of the goal or superior to the goal. For example, I’m quite sure that F1 rules prohibit interfering with drivers during the game; but if somehow a silicon-reaction-speed AGI can’t win F1 by default, then it may find it simpler/quicker to harm the opponents in one of the infinity ways that the F1 rules don’t cover—say, getting some funds in financial arbitrage, buying out the other teams, and firing any good drivers or engineering a virus that halves the reaction speed of all homo-sapiens—and then it would be happy as the goal is achieved within the rules.
...because there is no reason for an AGI automagically creating arbitrary restrictions if they aren’t part of the goal or superior to the goal.
That’s clear. But let me again state what I’d like to inquire. Given the large amount of restrictions that are inevitably part of any advanced general intelligence (AGI), isn’t the nonhazardous subset of all possible outcomes much larger than that where the AGI works perfectly yet fails to hold before it could wreak havoc? Here is where this question stems from. Given my current knowledge about AGI I believe that any AGI capable of dangerous self-improvement will be very sophisticated, including a lot of restrictions. For example, I believe that any self-improvement can only be as efficient as the specifications of its output are detailed. If for example the AGI is build with the goal in mind to produce paperclips, the design specifications of what a paperclip is will be used as leveling rule by which to measure and quantify any improvement of the AGI’s output. This means that to be able to effectively self-improve up to a superhuman level, the design specifications will have to be highly detailed and by definition include sophisticated restrictions. Therefore to claim that any work on AGI will almost certainly lead to dangerous outcomes is to assert that any given AGI is likely to work perfectly well, subject to all restrictions except one that makes it hold (spatiotemporal scope boundaries). I’m unable to arrive at that conclusion as I believe that most AGI’s will fail extensive self-improvement as that is where failure is most likely for that it is the largest and most complicated part of the AGI’s design parameters. To put it bluntly, why is it more likely that contemporary AGI research will succeed at superhuman self-improvement (beyond learning), yet fail to limit the AGI, rather than vice versa? As I see it, it is more likely, given the larger amount of parameters to be able to self-improve in the first place, that most AGI research will result in incremental steps towards human-level intelligence rather than one huge step towards superhuman intelligence that fails on its scope boundary rather than self-improvement.
What you are envisioning is not an AGI at all, but a narrow AI. If you tell an AGI to make paperclips, but it doesn’t know what a paperclip is, then it will go and find out, using whatever means it has available. It won’t give up just because you weren’t detailed enough in telling it what you wanted.
Then I don’t think that there is anyone working on what you are envisioning as ‘AGI’ right now. If a superhuman level of sophistication regarding the potential for self-improvement is already part of your definition then there is no argument to be won or lost here regarding risk assessment of research on AGI. I do not believe this is reasonable or that AGI researchers share your definition. I believe that there is a wide range of artificial general intelligence that does not suit your definition yet deserves this terminology.
Who said anything about a superhuman level of sophistication? Human-level is enough. I’m reasonably certain that if I had the same advantages an AGI would have—that is, if I were converted into an emulation and given my own source code—then I could foom. And I think any reasonably skilled computer programmer could, too.
Yes, but after the AGI finds out what a paperclip is, it will then, if it is an AGI, start questioning why it was designed with the goal of building paperclips in the first place. And that’s where the friendly AI fallacy falls apart.
Anissimov posted a good article on exactly this point today. AGI will only question its goals according to its cognitive architecture, and come to a conclusion about its goals depending on its architecture. It could “question” its paperclip-maximization goal and come to a “conclusion” that what it really should do is tile the universe with foobarian holala.
So what? An agent with a terminal value (building paperclips) is not going to give it up, not for anything. That’s what “terminal value” means. So the AI can reason about human goals and the history of AGI research. That doesn’t mean it has to care. It cares about paperclips.
That doesn’t mean it has to care. It cares about paperclips.
It has to care because if there is the slightest motivation to be found in its goal system to hold (parameters for spatiotemporal scope boundaries), then it won’t care to continue anyway. I don’t see where the incentive to override certain parameters of its goals should come from. As Anissimov said, “If an AI questions its values, the questioning will have to come from somewhere.”
It won’t care unless it’s been programmed to care (for example by adding “spatiotemporal scope boundaries” to its goal system). It’s not going to override a terminal goal, unless it conflicts with a different terminal goal. In the context of an AI that’s been instructed to “build paperclips”, it has no incentive to care about humans, no matter how much “introspection” it does.
If you do program it to care about humans then obviously it will care. It’s my understanding that that is the hard part.
″ How is it that the AGI is yet smart enough to learn this all by itself but fails to notice that there are rules to follow”—because there is no reason for an AGI automagically creating arbitrary restrictions if they aren’t part of the goal or superior to the goal. For example, I’m quite sure that F1 rules prohibit interfering with drivers during the game; but if somehow a silicon-reaction-speed AGI can’t win F1 by default, then it may find it simpler/quicker to harm the opponents in one of the infinity ways that the F1 rules don’t cover—say, getting some funds in financial arbitrage, buying out the other teams, and firing any good drivers or engineering a virus that halves the reaction speed of all homo-sapiens—and then it would be happy as the goal is achieved within the rules.
That’s clear. But let me again state what I’d like to inquire. Given the large amount of restrictions that are inevitably part of any advanced general intelligence (AGI), isn’t the nonhazardous subset of all possible outcomes much larger than that where the AGI works perfectly yet fails to hold before it could wreak havoc? Here is where this question stems from. Given my current knowledge about AGI I believe that any AGI capable of dangerous self-improvement will be very sophisticated, including a lot of restrictions. For example, I believe that any self-improvement can only be as efficient as the specifications of its output are detailed. If for example the AGI is build with the goal in mind to produce paperclips, the design specifications of what a paperclip is will be used as leveling rule by which to measure and quantify any improvement of the AGI’s output. This means that to be able to effectively self-improve up to a superhuman level, the design specifications will have to be highly detailed and by definition include sophisticated restrictions. Therefore to claim that any work on AGI will almost certainly lead to dangerous outcomes is to assert that any given AGI is likely to work perfectly well, subject to all restrictions except one that makes it hold (spatiotemporal scope boundaries). I’m unable to arrive at that conclusion as I believe that most AGI’s will fail extensive self-improvement as that is where failure is most likely for that it is the largest and most complicated part of the AGI’s design parameters. To put it bluntly, why is it more likely that contemporary AGI research will succeed at superhuman self-improvement (beyond learning), yet fail to limit the AGI, rather than vice versa? As I see it, it is more likely, given the larger amount of parameters to be able to self-improve in the first place, that most AGI research will result in incremental steps towards human-level intelligence rather than one huge step towards superhuman intelligence that fails on its scope boundary rather than self-improvement.
What you are envisioning is not an AGI at all, but a narrow AI. If you tell an AGI to make paperclips, but it doesn’t know what a paperclip is, then it will go and find out, using whatever means it has available. It won’t give up just because you weren’t detailed enough in telling it what you wanted.
Then I don’t think that there is anyone working on what you are envisioning as ‘AGI’ right now. If a superhuman level of sophistication regarding the potential for self-improvement is already part of your definition then there is no argument to be won or lost here regarding risk assessment of research on AGI. I do not believe this is reasonable or that AGI researchers share your definition. I believe that there is a wide range of artificial general intelligence that does not suit your definition yet deserves this terminology.
Who said anything about a superhuman level of sophistication? Human-level is enough. I’m reasonably certain that if I had the same advantages an AGI would have—that is, if I were converted into an emulation and given my own source code—then I could foom. And I think any reasonably skilled computer programmer could, too.
Debugging will be PITA. Both ways.
Yes, but after the AGI finds out what a paperclip is, it will then, if it is an AGI, start questioning why it was designed with the goal of building paperclips in the first place. And that’s where the friendly AI fallacy falls apart.
Anissimov posted a good article on exactly this point today. AGI will only question its goals according to its cognitive architecture, and come to a conclusion about its goals depending on its architecture. It could “question” its paperclip-maximization goal and come to a “conclusion” that what it really should do is tile the universe with foobarian holala.
So what? An agent with a terminal value (building paperclips) is not going to give it up, not for anything. That’s what “terminal value” means. So the AI can reason about human goals and the history of AGI research. That doesn’t mean it has to care. It cares about paperclips.
It has to care because if there is the slightest motivation to be found in its goal system to hold (parameters for spatiotemporal scope boundaries), then it won’t care to continue anyway. I don’t see where the incentive to override certain parameters of its goals should come from. As Anissimov said, “If an AI questions its values, the questioning will have to come from somewhere.”
Exactly? I think we agree about this.
It won’t care unless it’s been programmed to care (for example by adding “spatiotemporal scope boundaries” to its goal system). It’s not going to override a terminal goal, unless it conflicts with a different terminal goal. In the context of an AI that’s been instructed to “build paperclips”, it has no incentive to care about humans, no matter how much “introspection” it does.
If you do program it to care about humans then obviously it will care. It’s my understanding that that is the hard part.