For a self-modifying AI with causal validity semantics, the presence of a particular line of code is equivalent to the historical fact that, at some point, a human wrote that piece of code. If the historical fact is not binding, then neither is the code itself. The human-written code is simply sensory information about what code the humans think should be written.
The rule of derivative validity—“Effects cannot have greater validity than their causes.”—contains a flaw; it has no tail-end recursion. Of course, so does the rule of derivative causality—“Effects have causes”—and yet, we’re still here; there is Something rather than Nothing. The problem is more severe for derivative validity, however. At some clearly defined point after the Big Bang, there are no valid causes (before the rise of self-replicating chemicals on Earth, say); then, at some clearly defined point in the future (i.e., the rise of homo sapiens sapiens) there are valid causes. At some point, an invalid cause must have had a valid effect. To some extent you might get around this by saying that, [e.g.], self-replicating chemicals or evolved intelligences are pattern-identical with (represent) some Platonic valid cause—a low-entropy cause, so that evolved intelligences in general are valid causes—but then there would still be the question of what validates the Platonic cause. And so on.
I have an intuition that there is a version of reflective consistency which requires R to code S so that, if R was created by another agent Q, S would make decisions using Q’s beliefs even if Q’s beliefs were different from R’s beliefs (or at least the beliefs that a Bayesian updater would have had in R’s position), and even when S or R had uncertainty about which agent Q was. But I don’t know how to formulate that intuition to something that could be proven true or false. (But ultimately, S has to be a creator of its own successor states, and S should use the same theory to describe its relation to its past selves as to describe its relation to R or Q. S’s decisions should be invariant to the labeling or unlabeling of its past selves as “creators”. These sequential creations are all part of the same computational process.)
Yes, any physical system could be subverted with a sufficiently unfavorable environment. You wouldn’t want to prove perfection. The thing you would want to prove would be more along the lines of, “will this system become at least somewhere around as capable of recovering from any disturbances, and of going on to achieve a good result, as it would be if its designers had thought specifically about what to do in case of each possible disturbance?”. (Ideally, this category of “designers” would also sort of bleed over in a principled way into the category of “moral constituency”, as in CEV.) Which, in turn, would require a proof of something along the lines of “the process is highly likely to make it to the point where it knows enough about its designers to be able to mostly duplicate their hypothetical reasoning about what it should do, without anything going terribly wrong”.
We don’t know what an appropriate formalization of something like that would look like. But there is reason for considerable hope that such a formalization could be found, and that this formalization would be sufficiently simple that an implementation of it could be checked. This is because a few other aspects of decision-making which were previously mysterious, and which could only be discussed qualitatively, have had powerful and simple core mathematical descriptions discovered for cases where simplifying modeling assumptions perfectly apply. Shannon information was discovered for the informal notion of surprise (with the assumption of independent identically distributed symbols from a known distribution). Bayesian decision theory was discovered for the informal notion of rationality (with assumptions like perfect deliberation and side-effect-free cognition). And Solomonoff induction was discovered for the informal notion of Occam’s razor (with assumptions like a halting oracle and a taken-for-granted choice of universal machine). These simple conceptual cores can then be used to motivate and evaluate less-simple approximations for situations where where the assumptions about the decision-maker don’t perfectly apply. For the AI safety problem, the informal notions (for which the mathematical core descriptions would need to be discovered) would be a bit more complex—like the “how to figure out what my designers would want to do in this case” idea above. Also, you’d have to formalize something like our informal notion of how to generate and evaluate approximations, because approximations are more complex than the ideals they approximate, and you wouldn’t want to need to directly verify the safety of any more approximations than you had to. (But note that, for reasons related to Rice’s theorem, you can’t (and therefore shouldn’t want to) lay down universally perfect rules for approximation in any finite system.)
— Eliezer Yudkowsky, Creating Friendly AI
— Eliezer Yudkowsky, Creating Friendly AI
— Steve Rayhawk, commenting on Wei Dai’s “Towards a New Decision Theory”
— Steve Rayhawk