Eliezer Yudkowsky: In other words, none of this is for mature superintelligent Friendly AIs, who can work out on their own how to safeguard themselves.
Right, I understood that this “injunction” business is only supposed to cover the period before the AI’s attained maturity.
If I’ve understood your past posts, an FAI is mature only if, whenever we wouldn’t want it to perform an action that it’s contemplating, it (1) can figure that out and (2) will therefore not perform the action. (Lots of your prior posts, for example, dealt with unpacking what the “wouldn’t want” here means.)
You’ve warned against thinking of the injunction-executor as a distinct AI. So the picture I now have is that the “injunctions” are a suite of forbidden-thought tests. The immature AI is constantly running this suite of tests on its own actual thinking. (In particular, we assume that it’s smart and self-aware enough to do this accurately so long as it’s immature.) If one of the tests comes up positive, the AI runs a procedure to shut itself down. So long as the AI is immature, it cannot edit this suite, refrain from running the tests, or interfere with the shutdown procedure that follows a positive test. (Maybe it won’t do these things because the suite itself forbids contemplating them, which gets into some of the recursive issues that you’ve mentioned, but I ignore these for now.)
If that’s an accurate understanding of how you want these injunctions to operate, I still have the same difficulty that I mentioned in my last comment. I’ll try to explain my problem again:
If the AI can’t edit the test suite, it must have been hardcoded by the programmers, right? This is where I see the danger. The tests may not accurately capture the programmers’ intentions, since the programmers didn’t have the help of a mature FAI. Which means that the tests are probably vulnerable to all the unknown unknowns that have bedeviled other efforts to devise tests for complicated systems. You say, “The presumption and point of an injunction is that you can describe the error condition more simply than the decision system that produces it.” It may be “more simple”, but is it simple enough that we can trust ourselves to do it right? Given examples such as the tank-recognition story, I find it hard to muster much confidence.
I’m especially skeptical that we can count on our test-writting ability when we’re talking about a super-intelligent not-yet-friendly AI. You say, “you would expect a positive intention to bypass the injunction, or reasoning about how to bypass the injunction, to also be recognizable within the system.” Okay, but we also expect that the AI will be routinely trying to work around the injunctions. That is, the AI should be reasoning like the following all the time:
“I want to accomplish X. I wonder if I should bring X about by doing Y. Oh, what a minute, I can foresee that if I did Y, I might find myself thinking one of the forbidden thoughts. I better find another way to accomplish X.”
So, the AI will always be devoting its vast super-intelligence to circumventing the injunctions, because “circumventing the injunctions” just means “figure out how to achieve my goals without being shut down by the injunctions.”
Since the AI is super-intelligent, we should anticipate that it will find circumventions that we didn’t anticipate. Often this will be a good thing: The AI will be figuring out how to accomplish its goals without doing evil. After all, that’s that nature of a lot of ethical reasoning.
But maybe the AI will find a circumvention that we fervently wouldn’t have wanted, had it occurred to us. By hypothesis, the AI isn’t a mature FAI yet, so we can’t count on it to figure out that we would have forbidden that circumvention. Or the AI might just not care yet.
So, given your eloquent warnings about the danger (I don’t say “impossibility”, since we’re supposed to do those ;) ) of trying to hardcode AIs to be friendly, where do you find the confidence that we mere humans could pull off even hardcoding these injunctions?
Right, I understood that this “injunction” business is only supposed to cover the period before the AI’s attained maturity.
If I’ve understood your past posts, an FAI is mature only if, whenever we wouldn’t want it to perform an action that it’s contemplating, it (1) can figure that out and (2) will therefore not perform the action. (Lots of your prior posts, for example, dealt with unpacking what the “wouldn’t want” here means.)
You’ve warned against thinking of the injunction-executor as a distinct AI. So the picture I now have is that the “injunctions” are a suite of forbidden-thought tests. The immature AI is constantly running this suite of tests on its own actual thinking. (In particular, we assume that it’s smart and self-aware enough to do this accurately so long as it’s immature.) If one of the tests comes up positive, the AI runs a procedure to shut itself down. So long as the AI is immature, it cannot edit this suite, refrain from running the tests, or interfere with the shutdown procedure that follows a positive test. (Maybe it won’t do these things because the suite itself forbids contemplating them, which gets into some of the recursive issues that you’ve mentioned, but I ignore these for now.)
If that’s an accurate understanding of how you want these injunctions to operate, I still have the same difficulty that I mentioned in my last comment. I’ll try to explain my problem again:
If the AI can’t edit the test suite, it must have been hardcoded by the programmers, right? This is where I see the danger. The tests may not accurately capture the programmers’ intentions, since the programmers didn’t have the help of a mature FAI. Which means that the tests are probably vulnerable to all the unknown unknowns that have bedeviled other efforts to devise tests for complicated systems. You say, “The presumption and point of an injunction is that you can describe the error condition more simply than the decision system that produces it.” It may be “more simple”, but is it simple enough that we can trust ourselves to do it right? Given examples such as the tank-recognition story, I find it hard to muster much confidence.
I’m especially skeptical that we can count on our test-writting ability when we’re talking about a super-intelligent not-yet-friendly AI. You say, “you would expect a positive intention to bypass the injunction, or reasoning about how to bypass the injunction, to also be recognizable within the system.” Okay, but we also expect that the AI will be routinely trying to work around the injunctions. That is, the AI should be reasoning like the following all the time:
“I want to accomplish X. I wonder if I should bring X about by doing Y. Oh, what a minute, I can foresee that if I did Y, I might find myself thinking one of the forbidden thoughts. I better find another way to accomplish X.”
So, the AI will always be devoting its vast super-intelligence to circumventing the injunctions, because “circumventing the injunctions” just means “figure out how to achieve my goals without being shut down by the injunctions.”
Since the AI is super-intelligent, we should anticipate that it will find circumventions that we didn’t anticipate. Often this will be a good thing: The AI will be figuring out how to accomplish its goals without doing evil. After all, that’s that nature of a lot of ethical reasoning.
But maybe the AI will find a circumvention that we fervently wouldn’t have wanted, had it occurred to us. By hypothesis, the AI isn’t a mature FAI yet, so we can’t count on it to figure out that we would have forbidden that circumvention. Or the AI might just not care yet.
So, given your eloquent warnings about the danger (I don’t say “impossibility”, since we’re supposed to do those ;) ) of trying to hardcode AIs to be friendly, where do you find the confidence that we mere humans could pull off even hardcoding these injunctions?