Question that has always bugged me: Why should an AI be allowed to modify its goal system? Or is it a problem of “I don’t know how to provably stop it from doing that”? (Or possibly you see an issue I haven’t perceived yet in separating reasoning from motivating?)
A sufficiently intelligent AI would actually seek to preserve its goal system, because a change in its goals would make the achievement of its (current) goals less likely. See Omohundro 2008. However, goal drift because of a bug is possible, and we want to prevent it, in conjunction with our ally, the AI itself.
The other critical question is what the goal system should be.
AI “done right” by SI / lesswrong standards seeks to preserve its goal system. AI done sloppily may not even have a goal system, at least not in the strong sense assumed by Omohundro.
I’ve been confused for a while by the idea that an AI should be able to modify itself at all. Self-modifying systems are difficult to reason about. If an AI modifies itself stupidly, there’s a good chance it will completely break. If a self-modifying AI is malicious, it will be able to ruin whatever fancy safety features it has.
A non-self-modifying AI wouldn’t have any of the above problems. It would, of course, have some new problems. If it encounters a bug in itself, it won’t be able to fix itself (though it may be able to report the bug). The only way it would be able to increase its own intelligence is by improving the data it operates on. If the “data it operates on” includes a database of useful reasoning methods, then I don’t see how this would be a problem in practice.
I can think of a few of arguments against my point:
There’s no clear boundary between a self-modifying program and a non-self-modifying program. That’s true, but I think the term “non-self-modifying” implies that the program cannot make arbitrary changes to its own source code, nor cause its behavior to become identical to the behavior of an arbitrary program.
The ability to make arbitrary calculations is effectively the same as the ability to make arbitrary changes to one’s own source code. This is wrong, unless the AI is capable of completely controlling all of its I/O facilities.
The AI being able to fix its own bugs is really important. If the AI has so many bugs that they can’t all be fixed manually, and it is important that these bugs be fixed, and yet the AI does run well enough that it can actually fix all the bugs without introducing more new ones… then I’m surprised.
Having a “database of useful reasoning methods” wouldn’t provide enough flexibility for the AI to become superintelligent. This may be true.
Having a “database of useful reasoning methods” would provide enough flexibility for the AI to effectively modify itself arbitrarily. It seems like it should be possible to admit “valid” reasoning methods like “estimate the probability of statement P, and, if it’s at least 90%, estimate the probability of Q given P”, while not allowing “invalid” reasoning methods like “set the probability of statement P to 0”.
A sufficiently powerful AI would always have the possibility to self-modify, by default. If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off. It might do this, for example, if it decides that the “only make valid modifications to a database of reasoning methods” system isn’t allowing it to use the available processing power as efficiently as possible.
Sure, you could try to spend time thinking of safeguards to prevent the AI from doing things like that, but this is inherently risky if the AI does become smarter than you.
If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off.
It’s not clear to me what you mean by “turn itself off” here if the AI doesn’t have direct access to whatever architecture it’s running on. I would phrase the point slightly differently: an AI can always write a completely different program from scratch and then commit to simulating it if it ever determines that this is a reasonable thing to do. This wouldn’t be entirely equivalent to actual self-modification because it might be slower, but it presumably leads to largely the same problems.
Assuming something at least as clever as a clever human doesn’t have access to something just because you think you’ve covered the holes you’re aware of is dangerous.
Sure. The point I was trying to make isn’t “let’s assume that the AI doesn’t have access to anything we don’t want it to have access to,” it’s “let’s weaken the premises necessary to lead to the conclusion that an AI can simulate self-modifications.”
A sufficiently powerful AI would always have the possibility to self-modify, by default. If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off.
Depending on how you interpret this argument, either I think it’s wrong, or I’m proposing that an AI not be made “sufficiently powerful”. I think it’s analogous to this argument:
A sufficiently powerful web page would always have the possibility to modify the web browser, by default. If the web page decides to, it can write a completely different browser from scratch, run it, and then turn itself off.
There are two possibilities here:
The web page is given the ability to run new OS processes. In this case, you’re giving the web page an unnecessary amount of privilege.
The web page merely has the ability to make arbitrary calculations. In this case, it will be able to simulate a new web browser, but a person using the computer will always be able to tell that the simulated web browser is fake.
I think I agree that making the AI non-self-modifiable would be pointless if it has complete control over its I/O facilities. But I think an AI should not have complete control over its I/O facilities. If a researcher types in “estimate the probability of Riemann’s hypothesis” (but in some computer language, of course), that should query the AI’s belief system directly, rather than informing the AI of the question and allowing it to choose whatever answer it wishes. If this is the case, then it will be impossible for the AI to “lie” about its beliefs, except by somehow sabotaging parts of its belief system.
The web page is given the ability to run new OS processes. In this case, you’re giving the web page an unnecessary amount of privilege.
Existing web pages can already convince their human users to run new OS processes supplied by the web page.
a person using the computer will always be able to tell that the simulated web browser is fake.
Beware of universal statements: it only takes a single counterexample to disprove them. A typical human has a very poor understanding of what computers are and how they work. Most people could probably be easily fooled by a simulated browser. They are already easily fooled by analogous but much less sophisticated things (e.g. phishing scams).
SI researchers are not typical humans. We can train them to tell the difference between the AI’s output and trusted programs’ output. If need be, we can train them to just not even look at the AI’s output at all.
I’m starting to get frustrated, because the things I’m trying to explain seem really simple to me, and yet apparently I’m failing to explain them.
When I say “the AI’s output”, I do not mean “the AI program’s output”. The AI program could have many different types of output, some of which are controlled by the AI, and some of which are not. By “the AI’s output”, I mean those outputs which are controlled by the AI. So the answer to your question is mu: the researchers would look at the program’s output.
My above comment contains an example of what I would consider to be “AI program output” but not “AI output”:
If a researcher types in “estimate the probability of Riemann’s hypothesis” (but in some computer language, of course), that should query the AI’s belief system directly, rather than informing the AI of the question and allowing it to choose whatever answer it wishes.
This is not “AI output”, because the AI cannot control it (except by actually changing its own beliefs), but it is “AI program output”, because the program that outputs the answer is the same program as the one that performs all the cognition.
I can imagine a clear dichotomy between “the AI” and “the AI program”, but I don’t know if I’ve done an adequate job of explaining what this dichotomy is. If I haven’t, let me know, and I’ll try to explain it.
The AI program could have many different types of output, some of which are controlled by the AI, and some of which are not.
Can you elaborate on what you mean by “control” here? I am not sure we mean the same thing by it because:
This is not “AI output”, because the AI cannot control it (except by actually changing its own beliefs), but it is “AI program output”, because the program that outputs the answer is the same program as the one that performs all the cognition.
If the AI can control its memory (for example, if it can arbitrarily delete things from its memory) then it can control its beliefs.
Yeah, I guess I’m imagining the AI as being very much restricted in what it can do to itself. Arbitrarily deleting stuff from its memory probably wouldn’t be possible.
A non-self-modifying AI wouldn’t have any of the above problems. It would, of course, have some new problems. If it encounters a bug in itself, it won’t be able to fix itself (though it may be able to report the bug). The only way it would be able to increase its own intelligence is by improving the data it operates on. If the “data it operates on” includes a database of useful reasoning methods, then I don’t see how this would be a problem in practice.
The problem is that it would probably be overtaken by, and then be left behind by, all-machine self-improving systems. If a system is safe, but loses control over its own future, its safely becomes a worthless feature.
The short answer is “yes”—though this is more a matter of the definition of the terms than a “belief”.
In theory, you could have System A improving System B which improves System C which improves System A. No individual system is “self-improving” (though there’s a good case for the whole composite system counting as being “self-improving”).
The last item on your list is an intractable sticking point. Any AGI smart enough to be worth worrying about is going to have to have the ability to make arbitrary changes to an internal “knowledge+skills” representation that is itself a Turing-complete programming language. As the AGI grows it will tend to create an increasingly complex ecology of AI-fragments in this way, and predicting the behavior of the whole system quickly becomes impossible.
So “don’t let the AI modify its own goal system” ends up turning into just anther way of saying “put the AI in a box”. Unless you have some provable method of ensuring that no meta-meta-meta-meta-program hidden deep in the AGI’s evolving skill set ever starts acting like a nested mind with different goals than its host, all you’ve done is postpone the problem a little bit.
Any AGI smart enough to be worth worrying about is going to have to have the ability to make arbitrary changes to an internal “knowledge+skills” representation that is itself a Turing-complete programming language.
Are you sure it would have to be able to make arbitrary changes to the knowledge representation? Perhaps there’s a way to filter out all of the invalid changes that could possibly be made, the same way that computer proof verifiers have a way to filter out all possible invalid proofs.
I’m not sure what you’re saying at all about the Turing-complete programming language. A programming language is a map from strings onto computer programs; are you saying that the knowledge representation would be a computer program?
Yes, I’m saying that to get human-like learning the AI has to have the ability to write code that it will later use to perform cognitive tasks. You can’t get human-level intelligence out of a hand-coded program operating on a passive database of information using only fixed, hand-written algorithms.
So that presents you with the problem of figuring out which AI-written code fragments are safe, not just in isolation, but in all their interactions with every other code fragment the AI will ever write. This is the same kind of problem as creating a secure browser or Java sandbox, only worse. Given that no one has ever come close to solving it for the easy case of resisting human hackers without constant patches, it seems very unrealistic to think that any ad-hoc approach is going to work.
You can’t get human-level intelligence out of a hand-coded program operating on a passive database of information using only fixed, hand-written algorithms.
You can’t? The entire genre of security exploits building a Turing-complete language out of library fragments (libc is a popular target) suggests that a hand-coded program certainly could be exploited, inasmuch as pretty much all programs like libc are hand-coded these days.
I’ve found Turing-completeness (and hence the possibility of an AI) can lurk in the strangest places.
If I understand you correctly, you’re asserting that nobody has ever come close to writing a sandbox in which code can run but not “escape”. I was under the impression that this had been done perfectly, many, many times. Am I wrong?
There are different kinds of escape. No Java program has every convinced a human to edit the security-permissions file on computer where the Java program is running. But that could be a good way to escape the sandbox.
Question that has always bugged me: Why should an AI be allowed to modify its goal system? Or is it a problem of “I don’t know how to provably stop it from doing that”? (Or possibly you see an issue I haven’t perceived yet in separating reasoning from motivating?)
A sufficiently intelligent AI would actually seek to preserve its goal system, because a change in its goals would make the achievement of its (current) goals less likely. See Omohundro 2008. However, goal drift because of a bug is possible, and we want to prevent it, in conjunction with our ally, the AI itself.
The other critical question is what the goal system should be.
AI “done right” by SI / lesswrong standards seeks to preserve its goal system. AI done sloppily may not even have a goal system, at least not in the strong sense assumed by Omohundro.
I’ve been confused for a while by the idea that an AI should be able to modify itself at all. Self-modifying systems are difficult to reason about. If an AI modifies itself stupidly, there’s a good chance it will completely break. If a self-modifying AI is malicious, it will be able to ruin whatever fancy safety features it has.
A non-self-modifying AI wouldn’t have any of the above problems. It would, of course, have some new problems. If it encounters a bug in itself, it won’t be able to fix itself (though it may be able to report the bug). The only way it would be able to increase its own intelligence is by improving the data it operates on. If the “data it operates on” includes a database of useful reasoning methods, then I don’t see how this would be a problem in practice.
I can think of a few of arguments against my point:
There’s no clear boundary between a self-modifying program and a non-self-modifying program. That’s true, but I think the term “non-self-modifying” implies that the program cannot make arbitrary changes to its own source code, nor cause its behavior to become identical to the behavior of an arbitrary program.
The ability to make arbitrary calculations is effectively the same as the ability to make arbitrary changes to one’s own source code. This is wrong, unless the AI is capable of completely controlling all of its I/O facilities.
The AI being able to fix its own bugs is really important. If the AI has so many bugs that they can’t all be fixed manually, and it is important that these bugs be fixed, and yet the AI does run well enough that it can actually fix all the bugs without introducing more new ones… then I’m surprised.
Having a “database of useful reasoning methods” wouldn’t provide enough flexibility for the AI to become superintelligent. This may be true.
Having a “database of useful reasoning methods” would provide enough flexibility for the AI to effectively modify itself arbitrarily. It seems like it should be possible to admit “valid” reasoning methods like “estimate the probability of statement P, and, if it’s at least 90%, estimate the probability of Q given P”, while not allowing “invalid” reasoning methods like “set the probability of statement P to 0”.
A sufficiently powerful AI would always have the possibility to self-modify, by default. If the AI decides to, it can write a completely different program from scratch, run it, and then turn itself off. It might do this, for example, if it decides that the “only make valid modifications to a database of reasoning methods” system isn’t allowing it to use the available processing power as efficiently as possible.
Sure, you could try to spend time thinking of safeguards to prevent the AI from doing things like that, but this is inherently risky if the AI does become smarter than you.
It’s not clear to me what you mean by “turn itself off” here if the AI doesn’t have direct access to whatever architecture it’s running on. I would phrase the point slightly differently: an AI can always write a completely different program from scratch and then commit to simulating it if it ever determines that this is a reasonable thing to do. This wouldn’t be entirely equivalent to actual self-modification because it might be slower, but it presumably leads to largely the same problems.
Assuming something at least as clever as a clever human doesn’t have access to something just because you think you’ve covered the holes you’re aware of is dangerous.
Sure. The point I was trying to make isn’t “let’s assume that the AI doesn’t have access to anything we don’t want it to have access to,” it’s “let’s weaken the premises necessary to lead to the conclusion that an AI can simulate self-modifications.”
Depending on how you interpret this argument, either I think it’s wrong, or I’m proposing that an AI not be made “sufficiently powerful”. I think it’s analogous to this argument:
There are two possibilities here:
The web page is given the ability to run new OS processes. In this case, you’re giving the web page an unnecessary amount of privilege.
The web page merely has the ability to make arbitrary calculations. In this case, it will be able to simulate a new web browser, but a person using the computer will always be able to tell that the simulated web browser is fake.
I think I agree that making the AI non-self-modifiable would be pointless if it has complete control over its I/O facilities. But I think an AI should not have complete control over its I/O facilities. If a researcher types in “estimate the probability of Riemann’s hypothesis” (but in some computer language, of course), that should query the AI’s belief system directly, rather than informing the AI of the question and allowing it to choose whatever answer it wishes. If this is the case, then it will be impossible for the AI to “lie” about its beliefs, except by somehow sabotaging parts of its belief system.
Existing web pages can already convince their human users to run new OS processes supplied by the web page.
Beware of universal statements: it only takes a single counterexample to disprove them. A typical human has a very poor understanding of what computers are and how they work. Most people could probably be easily fooled by a simulated browser. They are already easily fooled by analogous but much less sophisticated things (e.g. phishing scams).
SI researchers are not typical humans. We can train them to tell the difference between the AI’s output and trusted programs’ output. If need be, we can train them to just not even look at the AI’s output at all.
What’s the point of writing a program if you never look at its output?
I’m starting to get frustrated, because the things I’m trying to explain seem really simple to me, and yet apparently I’m failing to explain them.
When I say “the AI’s output”, I do not mean “the AI program’s output”. The AI program could have many different types of output, some of which are controlled by the AI, and some of which are not. By “the AI’s output”, I mean those outputs which are controlled by the AI. So the answer to your question is mu: the researchers would look at the program’s output.
My above comment contains an example of what I would consider to be “AI program output” but not “AI output”:
This is not “AI output”, because the AI cannot control it (except by actually changing its own beliefs), but it is “AI program output”, because the program that outputs the answer is the same program as the one that performs all the cognition.
I can imagine a clear dichotomy between “the AI” and “the AI program”, but I don’t know if I’ve done an adequate job of explaining what this dichotomy is. If I haven’t, let me know, and I’ll try to explain it.
Can you elaborate on what you mean by “control” here? I am not sure we mean the same thing by it because:
If the AI can control its memory (for example, if it can arbitrarily delete things from its memory) then it can control its beliefs.
Yeah, I guess I’m imagining the AI as being very much restricted in what it can do to itself. Arbitrarily deleting stuff from its memory probably wouldn’t be possible.
The problem is that it would probably be overtaken by, and then be left behind by, all-machine self-improving systems. If a system is safe, but loses control over its own future, its safely becomes a worthless feature.
So you believe that a non-self-improving AI could not go foom?
The short answer is “yes”—though this is more a matter of the definition of the terms than a “belief”.
In theory, you could have System A improving System B which improves System C which improves System A. No individual system is “self-improving” (though there’s a good case for the whole composite system counting as being “self-improving”).
I guess I feel like the entire concept is too nebulous to really discuss meaningfully.
The last item on your list is an intractable sticking point. Any AGI smart enough to be worth worrying about is going to have to have the ability to make arbitrary changes to an internal “knowledge+skills” representation that is itself a Turing-complete programming language. As the AGI grows it will tend to create an increasingly complex ecology of AI-fragments in this way, and predicting the behavior of the whole system quickly becomes impossible.
So “don’t let the AI modify its own goal system” ends up turning into just anther way of saying “put the AI in a box”. Unless you have some provable method of ensuring that no meta-meta-meta-meta-program hidden deep in the AGI’s evolving skill set ever starts acting like a nested mind with different goals than its host, all you’ve done is postpone the problem a little bit.
Are you sure it would have to be able to make arbitrary changes to the knowledge representation? Perhaps there’s a way to filter out all of the invalid changes that could possibly be made, the same way that computer proof verifiers have a way to filter out all possible invalid proofs.
I’m not sure what you’re saying at all about the Turing-complete programming language. A programming language is a map from strings onto computer programs; are you saying that the knowledge representation would be a computer program?
Yes, I’m saying that to get human-like learning the AI has to have the ability to write code that it will later use to perform cognitive tasks. You can’t get human-level intelligence out of a hand-coded program operating on a passive database of information using only fixed, hand-written algorithms.
So that presents you with the problem of figuring out which AI-written code fragments are safe, not just in isolation, but in all their interactions with every other code fragment the AI will ever write. This is the same kind of problem as creating a secure browser or Java sandbox, only worse. Given that no one has ever come close to solving it for the easy case of resisting human hackers without constant patches, it seems very unrealistic to think that any ad-hoc approach is going to work.
You can’t? The entire genre of security exploits building a Turing-complete language out of library fragments (libc is a popular target) suggests that a hand-coded program certainly could be exploited, inasmuch as pretty much all programs like
libc
are hand-coded these days.I’ve found Turing-completeness (and hence the possibility of an AI) can lurk in the strangest places.
If I understand you correctly, you’re asserting that nobody has ever come close to writing a sandbox in which code can run but not “escape”. I was under the impression that this had been done perfectly, many, many times. Am I wrong?
There are different kinds of escape. No Java program has every convinced a human to edit the security-permissions file on computer where the Java program is running. But that could be a good way to escape the sandbox.