I played around a bit with quining solutions, and came up with the following, which solves this toy problem fairly well:
def AI_Q2(n)(p):
- Look at the first n^^^n theorems of PA(n).
- if (one of them says "for each q, there exists an m such that p(q) != 'self-destruct' and p(q) = 'double down' implies AI_Q2(m)(q) = 'double down'"):
-- double down
- else:
-- take winnings
AI_Q2(3) should double down on AI_Q2(4) as well as AI_Q2(4^^^^4). (As well as variants that are provably equivalent to itself like speed/size-optimized versions of itself.) I sent this to Benja last night and he responded with (in part) “You’ve succeeded in making me uncertain whether quining approaches could actually be directly useful in solving the real problem (though it still doesn’t seem likely).”
I agree it doesn’t seem likely the real problem can be solved with quining approaches, but I’m post this solution here in case anyone can extend the idea further. At the very least it should be interesting to understand why quining approaches don’t work on the real problem. What relevant aspect of the real problem isn’t being captured by this toy problem?
The main relevant aspect of the real problem not captured by quining is the continuity of self-replacement with environmental action. All the quining approaches Marcello and I considered, involved differently treating the decisions to “modify self” and “do something in environment”. This means you can’t use actions to construct an environmental copy of yourself out of transistors, because the environmental copy wouldn’t get the privileged-special-case treatment. More generally, it seems to imply an arbitrary Cartesian boundary with different levels of mathematical trust for physical processes inside and outside the boundary. And that you can’t just frame desirable actions in terms of their predicted consequences.
Edit: This doesn’t mean quining approaches aren’t worth exploring. They might get us closer to a solution than contemplating, say, approaches that don’t quine.
Let me back up from my other response. It just occurred to me that UDT1.1 (with a proof system instead of “math intuition”) already constitutes a quining solution to AI reflection.
Consider an AI facing the choice of either creating a copy of itself, which will then go out into the world, or doing nothing. Unfortunately, due to Lobian problems it can’t prove that a copy of itself won’t do something worse than nothing. But UDT1.1 can be thought of as optimizing over an input/output mapping that is implemented by all of its copies. For each possible mapping, it proves a utility value starting from the assumption that it implements that map (which implies that its copies and provably equivalent variants also implement that map). So it doesn’t need to prove (from scratch) that its copy won’t do something worse than nothing.
I think there’d be a wide variety of systems where, so long as the “parent” agent knows the exact strategy that its child will deploy in all relevant situations at “compile time”, the parent will trust the child. The point of the Lob problem is that it arises when we want the parent to trust the child generally, without knowing exactly what the child will do. For the parent to precompute the child’s exact actions implies that the child can’t be smarter than the parent, so it’s not the kind of situation we would encounter when e.g. Agent A wants to build Agent B which has more RAM and faster CPUs than Agent A while still sharing Agent A’s goals. This, of course, is the kind of “agents building agents” scenario that I am most interested in.
Consider a resource-constrained variant of the original game:
Each program receives as input the round number n and the next program, encrypted by repeated xoring with the output of a monotonic computable function f(n). Let T_f(n) be the runtime of the fastest algorithm that computes f. Note that T_f(n) is monotonically increasing.
At round n, the current program has a time limit T(n) = C_0 + C_1 * T_f(n). Quirrell never submits programs that exceed the time limit. In this variant of the game, you have to submit the first program, which has to obey a time limit T(0).
The initial program will not be able to compute the relevant strategy of any of its successors (except at most finitely many of them). And yet, my quining solution still works (just add a decryption step), and I think Wei Dai’s solution also works.
During the April 2013 workshop I rephrased this as the principle “The actions and sensor values of the offspring should not appear outside of quantifiers”. Justification: If we have to reason case-by-case about all possible actions, all possible sensor values, and all possibles states of the world, our child’s size must be less than or equal to “the number of cases we can consider” / “child’s sensor state space” x “child’s action state space” x “world state space” which in general implies a logarithmically smaller child. I call this the Vingean Principle.
The main relevant aspect of the real problem not captured by quining is the continuity of self-replacement with environmental action.
Right, so we want an AI design that does the right thing “naturally”, without a special case for “modify self”. It seems to me that UDT may already be such a design, if we had a solution to logical uncertainty (what I called “math intuition module”) to plug into it. When I wrote about logical uncertainty being important for AGI, I had in mind the problem of not being able to prove P!=NP and hence not being able to prove whether EU(search for polynomial solution to 3-SAT) < EU(not search for polynomial solution to 3-SAT). But if we had an AI that can handle that kind of decision problems, why wouldn’t it also be able to design and build another AI that uses a proof system that it can’t prove to be sound? Is there a reason to think that “AI reflection” isn’t just a special case of “logical uncertainty”?
(Of course UDT would probably build successors/copies of itself that also use “math intuition modules” rather than “proof systems” but we can imagine forcing a UDT-AI to choose between a selection of other AIs that are based on different proof systems, in which it would choose based on an analysis of the trade-off between power vs risk of inconsistency/unsoundness, like how a human would make the choice.)
I can’t visualize how saying that a result is “uncertain” could make the Lobian issue go away—do you have a concrete visualization for this, and if so, can you sketch it even if very briefly?
Possibly relevant: No Turing machine can assign finite probability to the halting sequence.
(Regardless of whether UDT1.1 is a correct quining solution to AI Reflection, we ultimately still need something that is not so “brute force”, so let’s continue this thread.)
I can’t visualize how saying that a result is “uncertain” could make the Lobian issue go away—do you have a concrete visualization for this, and if so, can you sketch it even if very briefly?
I guess I’m saying that there’s not necessarily a Lob’s theorem or equivalent for a system that (does something like) assigns probabilities to mathematical statements instead of proving/disproving them from a fixed set of axioms. Do you think there is such an equivalent to Lob’s theorem, and if so what might it say?
Possibly relevant: No Turing machine can assign finite probability to the halting sequence.
I’m not seeing why this is relevant. Can you explain?
Do you think there is such an equivalent to Lob’s theorem, and if so what might it say?
In the Q&A to the open problems talk, he said that if logical uncertainty is the solution to the problem, then he and Marcello didn’t figure out how, and they did look in that specific place.
I think that it would be useful to try to make some progress about how to reason probabilistically about logical statements before trying to use it as the solution to other problems, because my picture of how it could work is so unclear that I feel unable to draw any conclusions about what it can or cannot do with any kind of certainty. For example, I do not know a condition on a probability distribution over mathematical statements that would prevent the spurious proof problem—when I read about playing chicken with the universe, it sounded very familiar, but I can’t now see a way to apply it in the case of a probability distribution. (Then again, if we do expect to find a solution to the Löbian problems in that area, perhaps we should be working on logical uncertainty now, not on Löbian problems...)
(Then again, if we do expect to find a solution to the Löbian problems in that area, perhaps we should be working on logical uncertainty now, not on Löbian problems...)
If it weren’t for concerns about AI risk, I would be advocating working on logical uncertainty instead of Löbian problems, because it seems to me that an AI using “math intuition” may not face a Löbian problem to begin with, or if it does, the problem and/or solution may be different enough from that of an AI using a proof system that any work we do now will not be of much help.
(From the perspective of reducing AI risk, maybe it’s good to focus people’s attention on Löbian problems, unless it leads them to start thinking that they ought to work on logical uncertainty due to reasoning like the above...)
Hm; are you saying you think FAI can probably be implemented without solving the logical uncertainty problem? My current visualization is that both the logical uncertainty problem and the safe rewrite problem will need to be solved—among others—and the reason I’ve been thinking about the rewrite problem is that using proof-based techniques, I at least had a grasp of how to think about it. (And my intuition has so far been that logical uncertainty will probably have diagonalization problems in a different guise when we actually have a concrete enough proposal to test this, so it seemed useful to think about the rewrite problem in the better-understood context, when trying to find in-principle solutions to the problems.)
Hm; are you saying you think FAI can probably be implemented without solving the logical uncertainty problem?
No, I was saying the opposite, and also hinting that working on and publishing results about logical uncertainty may be bad for AI risk because it helps AGI, not just FAI (whereas the AI reflection problem seems to be more specific to FAI). There’s also a discussion about this issue in the decision theory mailing list archives.
He’s talking about a probability distribution over the set of all bitstrings. You can’t assign p=0.5 (or any other constant real number) to each of them: that doesn’t sum to 1.0.
Do you gain anything by parameterizing on PA(n)? It seems like you could get a sound algorithm by replacing PA(n) in your definition by T, for any trustworthy system of axioms T. My first instinct was to pick T=ZFC, while you pick a dynamically changing subset of PA(omega), i.e. PA + a countably infinite sequence of consistency assertions.
Certainly PA(omega) is trustworthy, but there is nothing particularly canonical about it (you could equally well pick PA(omega)+CON(PA(omega))), and I don’t think there is any loss of generality in picking the axioms up-front.
I think you’re right. Parameterizing on PA(n) doesn’t really seem to do anything except perhaps give an illusion of self-improvement when there’s really a hard limit of PA(omega) that the AI can’t go beyond. I guess I choose it mostly because the OP was doing something similar.
If you just have PA(omega) and, say, parameterize on the proof limit instead, will your program double down on versions of itself that use weaker formal theories, not mentioned in your program?
Yes. If you call it with the argument AIQ2_T (i.e. the same algorithm but with T instead of PA(omega)), it will accept if it can find a proof that forall q,
AIQ2_T(q) != ‘self-destruct’
(which is easy) and
AIQ2_T(q) == ‘double’ ==> AIQ2(q)==‘double’
The latter condition expands to
PA(omega) |- someFormula ==> T |- someFormula
which may be quite easy to show, depending on T.
Basically, the Quining approach avoids trouble from the 2nd incompleteness theorem by using relative consistency instead of consistency.
It seems to me that the two programs don’t have the same “someFormula”, because each program’s formula uses a quined description of that program, but not the other one. Or maybe I’m wrong. Can you expand what you mean by “someFormula”?
I should be explicit that I don’t assign strong confidence to either of the following intuitions, but FWIW:
One point where intuitively this may be more likely to fail than my trick is when you try to replace yourself by something that’s expected to be better than you at maximizing its expected utility with bounded computational resources. If you want something better than yourself, it seems rather limiting to require that it won’t do anything you wouldn’t do. (That’s an argument against quining approaches being the right thing, not for my approach being that.) -- On the other hand, an intuition in the other direction would be that if your original AI has a parameter n such that for n → infinity, it converges to an expected utility maximizer, then your approach might still lead to an increasing sequence of better and better EU maximizers.
Very nice! It looks like your solution also doubles down on simple quining solutions, by Lob’s theorem.
ETA: this makes me suspect that the problem might have a Loebian solution, just like the PD had a quining one and a Loebian one. Can’t figure it out yet, though.
OK, here’s my current candidate Loebian solution. It might be wrong and I don’t have any proof for it yet.
Let’s call a program “safe” if Quirrell can’t make it self-destruct in any number of steps. For example, a quining solution is safe, and that fact is easily provable in PA.
Let N be a large number. Our program a(p) will enumerate all proofs in PA up to length N that the statement “a is safe” implies “p is safe”. If such a proof is found, the program doubles down, otherwise it takes winnings.
The above program will double down on quining programs, and also on logical equivalents of itself. I don’t yet know whether it will double down on versions of itself with higher values of N, or on alternate implementations of the same algorithm (which isn’t the same as logical equivalents of itself, due to quining).
ETA: ok, sorry, scratch this for now. I can’t even figure out how to prove that the program is safe, and maybe it isn’t.
I played around a bit with quining solutions, and came up with the following, which solves this toy problem fairly well:
AI_Q2(3) should double down on AI_Q2(4) as well as AI_Q2(4^^^^4). (As well as variants that are provably equivalent to itself like speed/size-optimized versions of itself.) I sent this to Benja last night and he responded with (in part) “You’ve succeeded in making me uncertain whether quining approaches could actually be directly useful in solving the real problem (though it still doesn’t seem likely).”
I agree it doesn’t seem likely the real problem can be solved with quining approaches, but I’m post this solution here in case anyone can extend the idea further. At the very least it should be interesting to understand why quining approaches don’t work on the real problem. What relevant aspect of the real problem isn’t being captured by this toy problem?
The main relevant aspect of the real problem not captured by quining is the continuity of self-replacement with environmental action. All the quining approaches Marcello and I considered, involved differently treating the decisions to “modify self” and “do something in environment”. This means you can’t use actions to construct an environmental copy of yourself out of transistors, because the environmental copy wouldn’t get the privileged-special-case treatment. More generally, it seems to imply an arbitrary Cartesian boundary with different levels of mathematical trust for physical processes inside and outside the boundary. And that you can’t just frame desirable actions in terms of their predicted consequences.
Edit: This doesn’t mean quining approaches aren’t worth exploring. They might get us closer to a solution than contemplating, say, approaches that don’t quine.
Let me back up from my other response. It just occurred to me that UDT1.1 (with a proof system instead of “math intuition”) already constitutes a quining solution to AI reflection.
Consider an AI facing the choice of either creating a copy of itself, which will then go out into the world, or doing nothing. Unfortunately, due to Lobian problems it can’t prove that a copy of itself won’t do something worse than nothing. But UDT1.1 can be thought of as optimizing over an input/output mapping that is implemented by all of its copies. For each possible mapping, it proves a utility value starting from the assumption that it implements that map (which implies that its copies and provably equivalent variants also implement that map). So it doesn’t need to prove (from scratch) that its copy won’t do something worse than nothing.
(Requested reply.)
I think there’d be a wide variety of systems where, so long as the “parent” agent knows the exact strategy that its child will deploy in all relevant situations at “compile time”, the parent will trust the child. The point of the Lob problem is that it arises when we want the parent to trust the child generally, without knowing exactly what the child will do. For the parent to precompute the child’s exact actions implies that the child can’t be smarter than the parent, so it’s not the kind of situation we would encounter when e.g. Agent A wants to build Agent B which has more RAM and faster CPUs than Agent A while still sharing Agent A’s goals. This, of course, is the kind of “agents building agents” scenario that I am most interested in.
That’s not the case.
Consider a resource-constrained variant of the original game:
Each program receives as input the round number n and the next program, encrypted by repeated xoring with the output of a monotonic computable function f(n).
Let T_f(n) be the runtime of the fastest algorithm that computes f. Note that T_f(n) is monotonically increasing.
At round n, the current program has a time limit T(n) = C_0 + C_1 * T_f(n). Quirrell never submits programs that exceed the time limit.
In this variant of the game, you have to submit the first program, which has to obey a time limit T(0).
The initial program will not be able to compute the relevant strategy of any of its successors (except at most finitely many of them).
And yet, my quining solution still works (just add a decryption step), and I think Wei Dai’s solution also works.
It seems to me that the variant with time limits has a simple quining solution:
1) Get the time limit as input.
2) Spend almost all available time trying to prove in some formal theory that the next program is equivalent to this one.
3) Double down if a proof is found, otherwise take winnings.
That’s similar to your idea, right? I’m not sure if it addresses Eliezer’s objection, because I no longer understand his objection...
Yes.
During the April 2013 workshop I rephrased this as the principle “The actions and sensor values of the offspring should not appear outside of quantifiers”. Justification: If we have to reason case-by-case about all possible actions, all possible sensor values, and all possibles states of the world, our child’s size must be less than or equal to “the number of cases we can consider” / “child’s sensor state space” x “child’s action state space” x “world state space” which in general implies a logarithmically smaller child. I call this the Vingean Principle.
Right, so we want an AI design that does the right thing “naturally”, without a special case for “modify self”. It seems to me that UDT may already be such a design, if we had a solution to logical uncertainty (what I called “math intuition module”) to plug into it. When I wrote about logical uncertainty being important for AGI, I had in mind the problem of not being able to prove P!=NP and hence not being able to prove whether EU(search for polynomial solution to 3-SAT) < EU(not search for polynomial solution to 3-SAT). But if we had an AI that can handle that kind of decision problems, why wouldn’t it also be able to design and build another AI that uses a proof system that it can’t prove to be sound? Is there a reason to think that “AI reflection” isn’t just a special case of “logical uncertainty”?
(Of course UDT would probably build successors/copies of itself that also use “math intuition modules” rather than “proof systems” but we can imagine forcing a UDT-AI to choose between a selection of other AIs that are based on different proof systems, in which it would choose based on an analysis of the trade-off between power vs risk of inconsistency/unsoundness, like how a human would make the choice.)
I can’t visualize how saying that a result is “uncertain” could make the Lobian issue go away—do you have a concrete visualization for this, and if so, can you sketch it even if very briefly?
Possibly relevant: No Turing machine can assign finite probability to the halting sequence.
(Regardless of whether UDT1.1 is a correct quining solution to AI Reflection, we ultimately still need something that is not so “brute force”, so let’s continue this thread.)
I guess I’m saying that there’s not necessarily a Lob’s theorem or equivalent for a system that (does something like) assigns probabilities to mathematical statements instead of proving/disproving them from a fixed set of axioms. Do you think there is such an equivalent to Lob’s theorem, and if so what might it say?
I’m not seeing why this is relevant. Can you explain?
In the Q&A to the open problems talk, he said that if logical uncertainty is the solution to the problem, then he and Marcello didn’t figure out how, and they did look in that specific place.
I think that it would be useful to try to make some progress about how to reason probabilistically about logical statements before trying to use it as the solution to other problems, because my picture of how it could work is so unclear that I feel unable to draw any conclusions about what it can or cannot do with any kind of certainty. For example, I do not know a condition on a probability distribution over mathematical statements that would prevent the spurious proof problem—when I read about playing chicken with the universe, it sounded very familiar, but I can’t now see a way to apply it in the case of a probability distribution. (Then again, if we do expect to find a solution to the Löbian problems in that area, perhaps we should be working on logical uncertainty now, not on Löbian problems...)
If it weren’t for concerns about AI risk, I would be advocating working on logical uncertainty instead of Löbian problems, because it seems to me that an AI using “math intuition” may not face a Löbian problem to begin with, or if it does, the problem and/or solution may be different enough from that of an AI using a proof system that any work we do now will not be of much help.
(From the perspective of reducing AI risk, maybe it’s good to focus people’s attention on Löbian problems, unless it leads them to start thinking that they ought to work on logical uncertainty due to reasoning like the above...)
Hm; are you saying you think FAI can probably be implemented without solving the logical uncertainty problem? My current visualization is that both the logical uncertainty problem and the safe rewrite problem will need to be solved—among others—and the reason I’ve been thinking about the rewrite problem is that using proof-based techniques, I at least had a grasp of how to think about it. (And my intuition has so far been that logical uncertainty will probably have diagonalization problems in a different guise when we actually have a concrete enough proposal to test this, so it seemed useful to think about the rewrite problem in the better-understood context, when trying to find in-principle solutions to the problems.)
No, I was saying the opposite, and also hinting that working on and publishing results about logical uncertainty may be bad for AI risk because it helps AGI, not just FAI (whereas the AI reflection problem seems to be more specific to FAI). There’s also a discussion about this issue in the decision theory mailing list archives.
Can someone explain how an “always return 0.5” bot is not a counterexample?
He’s talking about a probability distribution over the set of all bitstrings. You can’t assign p=0.5 (or any other constant real number) to each of them: that doesn’t sum to 1.0.
Do you gain anything by parameterizing on PA(n)? It seems like you could get a sound algorithm by replacing PA(n) in your definition by T, for any trustworthy system of axioms T. My first instinct was to pick T=ZFC, while you pick a dynamically changing subset of PA(omega), i.e. PA + a countably infinite sequence of consistency assertions.
Certainly PA(omega) is trustworthy, but there is nothing particularly canonical about it (you could equally well pick PA(omega)+CON(PA(omega))), and I don’t think there is any loss of generality in picking the axioms up-front.
I think you’re right. Parameterizing on PA(n) doesn’t really seem to do anything except perhaps give an illusion of self-improvement when there’s really a hard limit of PA(omega) that the AI can’t go beyond. I guess I choose it mostly because the OP was doing something similar.
If you just have PA(omega) and, say, parameterize on the proof limit instead, will your program double down on versions of itself that use weaker formal theories, not mentioned in your program?
Yes. If you call it with the argument AIQ2_T (i.e. the same algorithm but with T instead of PA(omega)), it will accept if it can find a proof that forall q,
AIQ2_T(q) != ‘self-destruct’
(which is easy) and
AIQ2_T(q) == ‘double’ ==> AIQ2(q)==‘double’
The latter condition expands to
PA(omega) |- someFormula ==> T |- someFormula
which may be quite easy to show, depending on T.
Basically, the Quining approach avoids trouble from the 2nd incompleteness theorem by using relative consistency instead of consistency.
It seems to me that the two programs don’t have the same “someFormula”, because each program’s formula uses a quined description of that program, but not the other one. Or maybe I’m wrong. Can you expand what you mean by “someFormula”?
Oops.
I should be explicit that I don’t assign strong confidence to either of the following intuitions, but FWIW:
One point where intuitively this may be more likely to fail than my trick is when you try to replace yourself by something that’s expected to be better than you at maximizing its expected utility with bounded computational resources. If you want something better than yourself, it seems rather limiting to require that it won’t do anything you wouldn’t do. (That’s an argument against quining approaches being the right thing, not for my approach being that.) -- On the other hand, an intuition in the other direction would be that if your original AI has a parameter n such that for n → infinity, it converges to an expected utility maximizer, then your approach might still lead to an increasing sequence of better and better EU maximizers.
Very nice! It looks like your solution also doubles down on simple quining solutions, by Lob’s theorem.
ETA: this makes me suspect that the problem might have a Loebian solution, just like the PD had a quining one and a Loebian one. Can’t figure it out yet, though.
OK, here’s my current candidate Loebian solution. It might be wrong and I don’t have any proof for it yet.
Let’s call a program “safe” if Quirrell can’t make it self-destruct in any number of steps. For example, a quining solution is safe, and that fact is easily provable in PA.
Let N be a large number. Our program a(p) will enumerate all proofs in PA up to length N that the statement “a is safe” implies “p is safe”. If such a proof is found, the program doubles down, otherwise it takes winnings.
The above program will double down on quining programs, and also on logical equivalents of itself. I don’t yet know whether it will double down on versions of itself with higher values of N, or on alternate implementations of the same algorithm (which isn’t the same as logical equivalents of itself, due to quining).
ETA: ok, sorry, scratch this for now. I can’t even figure out how to prove that the program is safe, and maybe it isn’t.