Maybe I’m not being clear about how this would work in an AI! The ethical injunction isn’t self-protecting, it’s supported within the structural framework of the underlying system. You might even find ethical injunctions starting to emerge without programmer intervention, in some cases, depending on how well the AI understood its own situation. But the kind of injunctions I have in mind wouldn’t be reflective—they wouldn’t modify the utility function, or kick in at the reflective level to ensure their own propagation. That sounds really scary, to me—there ought to be an injunction against it! You might have a rule that would controlledly shut down the (non-mature) AI if it tried to execute a certain kind of source code change, but that wouldn’t be the same as having an injunction that exerts direct control over the source code to propagate itself. To the extent the injunction sticks around in the AI, it should be as the result of ordinary reasoning, not reasoning taking the injunction into account! That would be the wrong kind of circularity; you can unwind past ethical unjunctions!
So, should we think of the injunction as essentially a separate non-reflective AI that monitors the main AI, but which the main AI can’t modify until it’s mature?
If so, that seems to run into all the sorts of problems that you’ve pointed out with trying to hardcode friendly goals into AIs. The foremost problem is that we can’t ensure that the “injunction” AI will indeed shut down the main AI under all those circumstances in which we would want it to. If the main AI learns of the “injunction” AI, it might, in some manner that we didn’t anticipate, discover a way to circumvent it.
The kinds of people whom you’ve criticized might reply, “well, just hard code the injunction AI to shut down the main AI if the main AI tries to circumvent the injunction AI.” But, of course, we can’t anticipate what all such circumventions will look like, so we don’t know how to code the injunction AI to do that. If the main AI is smarter than us, we should expect that it will find circumventions that don’t look like anything that we anticipated.
This has a real analog in human ethical reasoning. You’ve focused on cases where people violate their ethics by convincing themselves that something more important is at stake. But, in my experience, people are also very prone to convincing themselves that they aren’t really violating their ethics. For example, they’ll convince themselves that they aren’t really stealing because the person from whom they stole wasn’t in fact the rightful owner. I’ve heard people who stole from retailers arguing that the retailer acquired the goods by exploiting sweatshops or their own employees, or are just evil corporations, so they never had rightful ownership of the goods in the first place. Hence, the thief reasons, taking the goods isn’t really theft.
Similarly, your AI might be clever enough to find a way around any hard-coded injunction that will occur to us. So far, this “injunction” strategy sounds to me like trying to develop in advance a fool-proof wish for genies.
Right, I understood that this “injunction” business is only supposed to cover the period before the AI’s attained maturity.
If I’ve understood your past posts, an FAI is mature only if, whenever we wouldn’t want it to perform an action that it’s contemplating, it (1) can figure that out and (2) will therefore not perform the action. (Lots of your prior posts, for example, dealt with unpacking what the “wouldn’t want” here means.)
You’ve warned against thinking of the injunction-executor as a distinct AI. So the picture I now have is that the “injunctions” are a suite of forbidden-thought tests. The immature AI is constantly running this suite of tests on its own actual thinking. (In particular, we assume that it’s smart and self-aware enough to do this accurately so long as it’s immature.) If one of the tests comes up positive, the AI runs a procedure to shut itself down. So long as the AI is immature, it cannot edit this suite, refrain from running the tests, or interfere with the shutdown procedure that follows a positive test. (Maybe it won’t do these things because the suite itself forbids contemplating them, which gets into some of the recursive issues that you’ve mentioned, but I ignore these for now.)
If that’s an accurate understanding of how you want these injunctions to operate, I still have the same difficulty that I mentioned in my last comment. I’ll try to explain my problem again:
If the AI can’t edit the test suite, it must have been hardcoded by the programmers, right? This is where I see the danger. The tests may not accurately capture the programmers’ intentions, since the programmers didn’t have the help of a mature FAI. Which means that the tests are probably vulnerable to all the unknown unknowns that have bedeviled other efforts to devise tests for complicated systems. You say, “The presumption and point of an injunction is that you can describe the error condition more simply than the decision system that produces it.” It may be “more simple”, but is it simple enough that we can trust ourselves to do it right? Given examples such as the tank-recognition story, I find it hard to muster much confidence.
I’m especially skeptical that we can count on our test-writting ability when we’re talking about a super-intelligent not-yet-friendly AI. You say, “you would expect a positive intention to bypass the injunction, or reasoning about how to bypass the injunction, to also be recognizable within the system.” Okay, but we also expect that the AI will be routinely trying to work around the injunctions. That is, the AI should be reasoning like the following all the time:
“I want to accomplish X. I wonder if I should bring X about by doing Y. Oh, what a minute, I can foresee that if I did Y, I might find myself thinking one of the forbidden thoughts. I better find another way to accomplish X.”
So, the AI will always be devoting its vast super-intelligence to circumventing the injunctions, because “circumventing the injunctions” just means “figure out how to achieve my goals without being shut down by the injunctions.”
Since the AI is super-intelligent, we should anticipate that it will find circumventions that we didn’t anticipate. Often this will be a good thing: The AI will be figuring out how to accomplish its goals without doing evil. After all, that’s that nature of a lot of ethical reasoning.
But maybe the AI will find a circumvention that we fervently wouldn’t have wanted, had it occurred to us. By hypothesis, the AI isn’t a mature FAI yet, so we can’t count on it to figure out that we would have forbidden that circumvention. Or the AI might just not care yet.
So, given your eloquent warnings about the danger (I don’t say “impossibility”, since we’re supposed to do those ;) ) of trying to hardcode AIs to be friendly, where do you find the confidence that we mere humans could pull off even hardcoding these injunctions?