Clarification of AI Reflection Problem

Consider an agent A, aware of its own embedding in some lawful universe, able to reason about itself and use that reasoning to inform action. By interacting with the world, A is able to modify itself or construct new agents, and using these abilities effectively is likely to be an important component of AGI. Our current understanding appears to be inadequate for guiding such an agent’s behavior, for (at least) the following reason:

If A does not believe “A’s beliefs reflect reality,” then A will lose interest in creating further copies of itself, improving its own reasoning, or performing natural self-modifications. Indeed, if A’s beliefs don’t reflect reality then creating more copies of A or spending more time thinking may do more harm than good. But if A does believe “A’s beliefs reflect reality,” then A runs immediately into Gödelian problems: for example, does A become convinced of the sentence Q = “A does not believe Q”? We need to find a way for A to have some confidence in its own behavior without running into these fundamental difficulties with reflection.

This problem has been discussed occasionally at Less Wrong, but I would like to clarify and lay out some examples before trying to start in on a resolution.

Gödel Machines

The Gödel machine is a formalism described by Shmidhuber for principled self-modification. A Gödel machine is designed to solve some particular object level problem in its allotted time. I will describe one Gödel machine implementation.

The initial machine A has an arbitrary object level problem solver. Before running the object level problem solver, however, A spends half of its time enumerating pairs of strings (A’, P); for each one, if A’ is a valid description of an agent and P is a proof that A’ does better on the object level task than A, then A transforms into A’.

Now suppose that A’s initial search for self-modifications is inefficient: a new candidate agent A’ has a more efficient proof checker, and so is able to examine more proofs before defaulting to the original object level problem solver. Intuitively, A’ is an improvement over A, and this is (a very mild version of) the sort of self-modification which we would like our agents to perform. For example, a human is easily able to recognize that it can achieve higher utility by modifying itself to think faster. But we will see that under most conditions A cannot prove that A’ is an improvement, because A is not able to assume its own consistency (as we do in our justification that A’ is an improvement).

Suppose A’ considers some new self-modification (A″, P) not considered by A. We would like A to be able to prove that if A’ executes this self-modification—if P is a proof that A″ is an improvement over A’—then A″ is really an improvement over A’ (and therefore over A). We say that a pair (A″, P) is safe if it has this property, i.e., if either P is not a proof that A″ is an improvement over A’, or if A″ is really an improvement over A’.

A can prove that particular pairs are safe, but suppose A’ considers many candidate modifications (A″, P1), (A″, P2), …, (A″, Pk). It is now much harder for A to show that all of these self-modifications are safe, without relying directly on the consistency of A’s axioms system (which A can never assume, by the incompleteness theorem). If the set of considered proofs is sufficiently rich, then by a bounded version of Löb’s theorem A can only prove that all of these self-modifications are safe if it can prove directly that A″ is an improvement over A’. But in that case A might as well dispense with theorem proving altogether and just commit to transforming into A″ after exhausting all other self-modifications.

(Originally this post was centered around a formal version of the above argument, but although I believe the argument works it turns out to be simultaneously very difficult and very unenlightening. Here is the main difficulty: in order to justify converting into A’, A only needs to prove that A’ won’t execute any destructive modifications if A’ doesn’t make some other modification before reaching any of the new proposals, and this additional assumption is very difficult to work with. In general, it is very hard to prove anything at all about a Gödel machine’s behavior, which is unfortunate because a Gödel machine must do exactly this in order to self-modify.)

This informal argument does not show that a Gödel machine will not execute any self-improvements, but it does show that if it does execute any non-trivial self-improvements then it must appeal to some justification other than the intuitive one. Other modifications, for example any modification which is so radical as to change the set of self-modifications considered in a more complicated way or to change the architecture fundamentally, seem to be even harder to justify than the one described here. Gödel machines may be able to self-modify, but this argument shows that this conclusion is not supported by intuitive arguments about the existence of beneficial self-modifications.

Creating New Agents

We can also try a non-reflective technique for designing an “optimal problem solver,” in the spirit of Hutter search. Again, suppose we have a particular object level problem and some allotted time.

Consider the agent A which behaves as follows: in the first half of the allotted time, it enumerates pairs (B, P), and if B is a valid description of an algorithm and P is a proof that B has utility at least U on the object level problem (using at most half of the allotted time), then A stores the pair (B, U). After half of the allotted time, the pair (B, U) with maximal U is selected, and then the algorithm B is run using the remaining time.

This solution seems like it may avoid some of the trouble encountered by the Gödel machine, in exchange for being somewhat less ambitious in its goals. But we will see that the same reflection problem arises.

If the allotted time is very large, and if self-modification is an important component of general intelligence, then we expect A to select a new algorithm B which is itself capable of (provably useful) self-modification. In principle A could simply run long enough to become a very powerful non-self-modifying agent; but it may be that finding a very powerful agent B directly is much more difficult then proceeding through a sequence of successively more powerful agents, each intelligent enough to design the next but not to leap directly to B.

But suppose A creates a new algorithm B which uses the same axioms as A to prove theorems, and uses these theorems to guide its own self-modification. Again, by Löb’s theorem we see that A cannot guarantee that B’s self-modifications are really beneficial. This means that A is unable to rule out the possibility that B executes a very bad self-modification (or creates a new agent with very bad behavior); but without ruling out the possibility that B performs very badly, A cannot run B instead of some more conservative algorithm which does not do any self-modification.

This does not rule out the possibility that A creates a new algorithm B’ which pursues self-modification using a similar strategy but with a more restricted set of axioms—one which can be proven consistent from within A. And in turn, if B’ takes the same form as A, then any algorithm C which B’ creates or modifies into will use a smaller still set of axioms by the same argument. Eliezer describes the proof-theoretic strength of such an agent as a ‘telomere’—with every self-modification it becomes weaker, and for Gödelian reasons it cannot be increased without sacrificing provable correctness.

(The distinction between the Gödel machine and creating new agents is inessential; for example, we could imagine a Gödel machine attempt to self-modify by improving the efficiency of its proof searcher and then using the additional time to consider some rewrites which are justified by a more restricted axiom system.)

Possible Solutions

There are some natural solutions which have been considered.

Long Telomeres

Recognizing that an agent must decrease its proof-theoretic strength when performing a self-modification, we may just try to construct an initial agent with substantial proof-theoretic strength. For example, if we believe that PA probably has sufficient logical expressiveness to facilitate useful superintelligence, we may create an agent with axioms PA + “PA is consistent” + “PA + ‘PA is consistent’ is consistent” and so on. We may iterate this process a finite number of times to allow for a few self-modifications, or perhaps add an entire countable chain so that the original AI can decide how much proof-theoretic strength its first descendant needs, or perhaps add a few countable chains so that the agent can self-modify a handful of times before needing to make that decision, or so on.

The primary objection to this solution is philosophical: humans appear to be able to reason about self-modification much more robustly (or at least with better results) than an AI using any of the logical frameworks discussed so far, and it is quite ambiguous whether or not a system with ‘long telomeres’ can ever justify becoming a system which is actually able to reason about self-modification correctly (or, if we have doubts about humans’ capacities, whether such a system even exists). Minimally, if we want to understand the form an AI is likely to take after self-modification, we need to understand how such a system would eventually escape from problems that currently seem inescapable; without such understanding the resulting behavior is hard to get a handle on (although we might still be satisfied, if that behavior is backed up by formal justification produced by the AI).

Lesser objections focus on the additional complexity required to maintain such a long telomere. Each additional consistency axiom added to a system essentially requires recursively specifying the entire agent as it exists so far, and the complexity of the mathematical manipulations required seems to grow rapidly. This argument seems to be fairly unimportant, and so if the philosophical objections were dispatched with I would consider building agents with very strong proof systems a feasible path forward.

Beliefs rather than Proofs

Humans seem to be able to avoid Gödelian problems with reflection, but they may do this by having confidence in statements which they cannot prove, and manipulating uncertainty about logical facts rather than dealing in absolute assertions of truth. The set of “statements I strongly believe” seems to differ in important ways from the set of “statements I can prove” in several important ways; for example, if I am confident of a statement at one point in time I am not committed to remaining confident indefinitely.

I suspect that a satisfactory theory of reflective beliefs would go far towards resolving the AI reflection problem, but this is hardly more than a restatement of the problem. Translating from proofs to beliefs does not immediately resolve the problems with self-reference, it just replaces them with subtly different issues. For example, it is easy to see that an agent should not believe that its own beliefs are well-calibrated on all questions, and so we must attempt to formalize a version of the self-consistency hypothesis which is weak enough to be defensible but still strong enough to escape the shortcomings described above. I hope to make several posts on this topic in the near future.

Ignorance

Understanding this issue may not be necessary to building safe AGI. Indeed, self-modification may ultimately play a minimal role in intelligence, or we may settle for executing self-modifications using weaker justification. However if we accept usual arguments about the importance of FAI, then we should not be satisfied with this solution.

The importance of self-modification is an open question which has received much discussion here and elsewhere. It is worth adding that, to the extent that we are concerned with influencing probable outcomes for humanity, the highest leverage scenarios seem to be those in which self-modification tends to result in positive feedback loops and takeoff (if we assign such scenarios significant weight). That is, in such scenarios we should be particularly cautious about building self-modifying systems, but there is also a much greater imperative to understand how to design safe and stable AI.

Standard arguments surrounding FAI (particularly, the importance of early AI goal systems and the fragility of humane value) suggest that agents should have high degrees of confidence in a change before executing it. If an agent’s beliefs are not correctly related to reality, the resulting behavior may be as dangerous as if the agent’s valus were modified. For example, incorrect beliefs about logical structure which cause that agent to fail to preserve its own values in subsequent rewrites, or incorrect beliefs about the relationships between value and reality.