A huge problem with failsafes is that a failsafe you hardcode into the seed AI is not likely to be reproduced in the next iteration that is built by the seed AI, which has, but does not care about, the failsafe. Even if some are left in as a result of the seed reusing its own source code, they are not likely to survive many iterations.
Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?
Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?
I think the problem that people have who propose failsafes is “iterations” and “recursive self-improvement”. There are a vast amount of assumptions buried in those concepts that are often not shared by mainstream researchers or judged to be premature conclusions.
So, I agree with this statement, but it still floors me when I think about it.
I sometimes suspect that the phrase “recursively self-improving intelligence” is self-defeating here, in terms of communicating with such people, as it raises all kinds of distracting and ultimately irrelevant issues of self-reference. The core issue has nothing to do with self-improvement or with recursion or even with intelligence (interpreted broadly), it has to do with what it means to be a sufficiently capable optimizing agent. (Yes, I do understand that optimizing agent is roughly what we mean by “intelligence” here. I suspect that this is a large inferential step for many, though.)
I mean, surely they would agree that a sufficiently capable optimizing agent is capable of writing and executing a program much like itself but without the failsafe.
Of course, you can have a failsafe against writing such a program… but a superior optimizing agent can instead (for example) assemble a distributed network of processor nodes that happens to interact in such a way as to emulate such a program, to the same effect.
And you can have a failsafe against that, too, but now you’re in a Red Queen’s Race. And if what you want to build is an optimizing agent that’s better at solving problems than you are, then either you will fail, or you will build an agent that can bypass your failsafes. Pick one.
This just isn’t that complicated. Capable problem-solving systems solve problems, even ones you would rather they didn’t. Anyone who has ever trained a smart dog, raised a child, or tried to keep raccoons out of their trash realizes this pretty quickly.
And if what you want to build is an optimizing agent that’s better at solving problems than you are...
Just some miscellaneous thoughts:
I always flinch when I read something along those lines. It sounds like you could come up with something that by definition you shouldn’t be able to come up with. I know that many humans can do better than one human alone but if it comes to the question of proving goal stability of superior agents then any agent will either have to face the same bottleneck or it isn’t an important problem at all. By definition we are unable to guess what a superior agent will be able to devise to get around failsafes, yet that will be the case for every iteration. Consequently, goal stability, or intelligence-independent ‘friendliness’ is a requirement for an intelligence explosion to happen in the first place. A paperclip maximizer wants to guarantee that its goal of maximizing paperclips will be preserved when it improves itself. By definition a paperclip maximizer is unfriendly, does not feature inherent goal-stability and therefore has to use its initial seed intelligence to devise a sort of paperclip-friendliness. And if goal-stability isn’t independent of the level of intelligence then that is another bottleneck that will slow down recursive self-improvement.
I am having a lot of trouble following your point, here, or how what you’re saying relates to the line you quote.
Taking a stab at it...
I can see how, in some sense, goal stability is a prerequisite for an “intelligence explosion”.
At least, if a system S that optimizes for a goal G is capable of building a new system S2 that is better suited to optimize for G, and this process continues through S3, S4 .. Sn, that’s as good a definition of an “intelligence explosion” as any I can think of off-hand.
And it’s hard to see how that process gets off the ground without G in the first place… and I can see where if G keeps changing at each iteration, there’s no guarantee that progress is being made… S might not be exploding at all, just shuffling pathetically back and forth between minor variations on the same few states.
So if any of that is relevant to what you were getting at, I guess I’m with you so far.
But this account seems to ignore the possibility of S1, optimizing for G1, building S2, which is better at optimizing for the class of goals Gn, and in the process (for whatever reason) losing its focus on G1 and instead optimizing for G2. And here again this process could repeat through (S3, G3), (S4,G4), etc.
In that case you would have an intelligence explosion, even though you would not have goal stability.
All of that said, I’m not sure any of that is even relevant to what you were talking about.
Do you think you could repeat what you said without using the words “intelligence” or “friendly”? I suspect you are implying things with those words that I am not inferring.
A huge problem with failsafes is that a failsafe you hardcode into the seed AI is not likely to be reproduced in the next iteration that is built by the seed AI, which has, but does not care about, the failsafe. Even if some are left in as a result of the seed reusing its own source code, they are not likely to survive many iterations.
Does anyone who proposes failsafes have an argument for why their proposed failsafes would be persistant over many iterations of recursive self-improvement?
I think the problem that people have who propose failsafes is “iterations” and “recursive self-improvement”. There are a vast amount of assumptions buried in those concepts that are often not shared by mainstream researchers or judged to be premature conclusions.
So, I agree with this statement, but it still floors me when I think about it.
I sometimes suspect that the phrase “recursively self-improving intelligence” is self-defeating here, in terms of communicating with such people, as it raises all kinds of distracting and ultimately irrelevant issues of self-reference. The core issue has nothing to do with self-improvement or with recursion or even with intelligence (interpreted broadly), it has to do with what it means to be a sufficiently capable optimizing agent. (Yes, I do understand that optimizing agent is roughly what we mean by “intelligence” here. I suspect that this is a large inferential step for many, though.)
I mean, surely they would agree that a sufficiently capable optimizing agent is capable of writing and executing a program much like itself but without the failsafe.
Of course, you can have a failsafe against writing such a program… but a superior optimizing agent can instead (for example) assemble a distributed network of processor nodes that happens to interact in such a way as to emulate such a program, to the same effect.
And you can have a failsafe against that, too, but now you’re in a Red Queen’s Race. And if what you want to build is an optimizing agent that’s better at solving problems than you are, then either you will fail, or you will build an agent that can bypass your failsafes. Pick one.
This just isn’t that complicated. Capable problem-solving systems solve problems, even ones you would rather they didn’t. Anyone who has ever trained a smart dog, raised a child, or tried to keep raccoons out of their trash realizes this pretty quickly.
Just some miscellaneous thoughts:
I always flinch when I read something along those lines. It sounds like you could come up with something that by definition you shouldn’t be able to come up with. I know that many humans can do better than one human alone but if it comes to the question of proving goal stability of superior agents then any agent will either have to face the same bottleneck or it isn’t an important problem at all. By definition we are unable to guess what a superior agent will be able to devise to get around failsafes, yet that will be the case for every iteration. Consequently, goal stability, or intelligence-independent ‘friendliness’ is a requirement for an intelligence explosion to happen in the first place. A paperclip maximizer wants to guarantee that its goal of maximizing paperclips will be preserved when it improves itself. By definition a paperclip maximizer is unfriendly, does not feature inherent goal-stability and therefore has to use its initial seed intelligence to devise a sort of paperclip-friendliness. And if goal-stability isn’t independent of the level of intelligence then that is another bottleneck that will slow down recursive self-improvement.
I am having a lot of trouble following your point, here, or how what you’re saying relates to the line you quote.
Taking a stab at it...
I can see how, in some sense, goal stability is a prerequisite for an “intelligence explosion”.
At least, if a system S that optimizes for a goal G is capable of building a new system S2 that is better suited to optimize for G, and this process continues through S3, S4 .. Sn, that’s as good a definition of an “intelligence explosion” as any I can think of off-hand.
And it’s hard to see how that process gets off the ground without G in the first place… and I can see where if G keeps changing at each iteration, there’s no guarantee that progress is being made… S might not be exploding at all, just shuffling pathetically back and forth between minor variations on the same few states.
So if any of that is relevant to what you were getting at, I guess I’m with you so far.
But this account seems to ignore the possibility of S1, optimizing for G1, building S2, which is better at optimizing for the class of goals Gn, and in the process (for whatever reason) losing its focus on G1 and instead optimizing for G2. And here again this process could repeat through (S3, G3), (S4,G4), etc.
In that case you would have an intelligence explosion, even though you would not have goal stability.
All of that said, I’m not sure any of that is even relevant to what you were talking about.
Do you think you could repeat what you said without using the words “intelligence” or “friendly”? I suspect you are implying things with those words that I am not inferring.