Problem: Lucy is using L in there, applied to her own formal system! That cannot work!
Lucy can assume a priori that it only takes “good” actions and use this to prove properties about Lucy_2, without having to prove the consistency of its own formal system.
I might be missing something, but it seems to me that this thing that you people call “Löbian obstacle” isn’t really about self-modification (it occurs even if Lucy = Lucy_2), but about internal consistency: it’s the old question of: “how can I know that I’m not insane?” Well, you can’t know, and there fundamentally is no way around it: Gödel’s incompleteness theorem, Löb’s theorem, Rice’s theorem, etc. are all formalized restatements of this fact.
So, why can’t Lucy just assume that it is sane? It seems to work well for humans.
Lucy reasons using formal mathematical logic. A formal mathematical logic cannot “assume it is sane” because of Loeb’s theorem. As long as Lucy isn’t self-modifying, all is fine and dandy. She just proves theorems and acts accordingly, it’s not like she has a built-in compulsion to prove her own sanity. However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2. Now, since Lucy2 is at least as smart as Lucy and preferably even smarter, Lucy has hard time predicting the specifics of how Lucy2 will behave. What Lucy wants is to reason generally, i.e. prove something along the lines of “Lucy2 will only do good things” rather than analyze what Lucy2 will do on case-by-case basis (Yudkowsky & Herreshoff call it “the Vingean principle” after Vernor Vinge). Unfortunately, Lucy is barred from proving such a general theorem. Now, we can try to hack Lucy’s system of reasoning to accept self-modifications on weaker grounds than proving they are “good” (e.g. allow proving their goodness in a stronger formal system). But that would involve treating self-modification as something ontologically different than all other phenomena in the universe, which doesn’t seem to be a good thing. We would like Lucy to be naturalistic i.e. to reason about herself and her descendants in the same way she reasons about everything else in the universe. Otherwise, she will be handicapped. For example, imagine that instead of self-modifying she considers to break into another much more powerful computer and programming that computer to be Lucy2. Such a scenario should be treated just like a self-modification, but since it doesn’t fall under the definition of self-modification hard-coded by the programmers, Lucy is back to using her “general purpose” reasoning mode rather than her special self-modification mode. Same goes for encountering another agent. We can try to introduce “agent” as a special ontological category but defining it would be hard.
It seems to work well for humans.
I have a suspicion human brains use something like the evidence logic I’m proposing. It feels that the concept of “proof” is much less native to human thinking than the concept of “argument for/against”.
She just proves theorems and acts accordingly, it’s not like she has a built-in compulsion to prove her own sanity.
I suppose she could reason that if she can’t prove that her formal system is consistent, then she can’t trust that the actions she can prove to be “good” are actually “good”. Or, even if she trusts her present self, she can’t trust her future selves, even if they are identical (well, they can never be completely identical if they accumulate memories).
Your argument works even for Lucy = Lucy_2.
However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2.
It seems to me that the problem you are trying to solve is how to coordinate with other agents. This has investigated in the context of program-equilibrium prisoner’s dilemma, and the only satisfactory solution I know is to form cliques based on computable under-approximations of functional equivalence between the program source codes (no, I don’t consider the FairBot/PrudentBot proposal a satisfactory solution, for reason I’ve explained here). Generalize this to arbitrary problems and you get UDT or something alike.
I suppose she could reason that if she can’t prove that her formal system is consistent, then she can’t trust that the actions she can prove to be “good” are actually “good”. Or, even if she trusts her present self, she can’t trust her future selves...
Lucy works by trying to prove a given action is “good” and doing it if she’s successful. She doesn’t need to prove that “there is no proof A is not good” into order to do A. However it’s true she can’t trust her future selves. Indeed it can be a problem even without self-modifications. For example, if someone drops an anvil on Lucy, Lucy cannot reason that “I should dodge the anvil because if I get destroyed I won’t do good things in the future”. She can still have a hard-coded aversion to being destroyed, but this seems like a hack.
...Generalize this to arbitrary problems and you get UDT or something alike.
UDT assumes a system of assigning probabilities to mathematical sentences given a (possibly inconsistent) theory. For UDT to admit naturalistic trust, it needs that system to overcome the Loebian obstacle, like the system I propose here. Indeed the system I propose is intended as a component of UDT or UIM.
“I should dodge the anvil because if I get destroyed I won’t do good things in the future”. She can still have a hard-coded aversion to being destroyed, but this seems like a hack.
I think it’s worse than that. If your argument is correct, the type of AI you are describing can’t plan because it can’t trust its future selves to follow through with the plan, even if doing so wouldn’t require commitment. It would be limited to immediate utility maximization.
It seems that your argument proves too much. In fact, I think that Lucy can still prove useful properties about its successors even without proving the consistency of their and therefore her own, proof system.
UDT assumes a system of assigning probabilities to mathematical sentences given a (possibly inconsistent) theory. For UDT to admit naturalistic trust, it needs that system to overcome the Loebian obstacle, like the system I propose here. Indeed the system I propose is intended as a component of UDT or UIM.
That’s not how UDT works in my understanding, even though I shall admit that I’m not an expert on the subject. Do you have a reference?
I think it’s worse than that. If your argument is correct, the type of AI you are describing can’t plan because it can’t trust its future selves to follow through with the plan, even if doing so wouldn’t require commitment.
We can avoid this problem if Lucy performs an action once it is the first in a provably “good” sequence of actions. This would allow her to dodge the anvil if it interferes with her immediate plans, but not on the general grounds of “a universe with Lucy is a better universe since Lucy is doing good things”.
That’s not how UDT works in my understanding, even though I shall admit that I’m not an expert on the subject. Do you have a reference?
I don’t have a reference which discusses UDT and the Loebian obstacle together. You can find a description of the AFAIK “latest and greatest” UDT here. UDT considers proofs in a formal system. If this system suffers from the Loebian obstacle this will lead to the kind of problems I discuss here. In fact, I haven’t stated it explicitly but I think of Lucy as a UDT agent: she considers possible actions as logical counterfactuals and computes expected utility based on that.
Lucy can assume a priori that it only takes “good” actions and use this to prove properties about Lucy_2, without having to prove the consistency of its own formal system.
I might be missing something, but it seems to me that this thing that you people call “Löbian obstacle” isn’t really about self-modification (it occurs even if Lucy = Lucy_2), but about internal consistency: it’s the old question of: “how can I know that I’m not insane?”
Well, you can’t know, and there fundamentally is no way around it: Gödel’s incompleteness theorem, Löb’s theorem, Rice’s theorem, etc. are all formalized restatements of this fact.
So, why can’t Lucy just assume that it is sane? It seems to work well for humans.
Lucy reasons using formal mathematical logic. A formal mathematical logic cannot “assume it is sane” because of Loeb’s theorem. As long as Lucy isn’t self-modifying, all is fine and dandy. She just proves theorems and acts accordingly, it’s not like she has a built-in compulsion to prove her own sanity. However, once she begins to self-modify (or encounters another agent of similar complexity), she has to start proving theorems about Lucy2. Now, since Lucy2 is at least as smart as Lucy and preferably even smarter, Lucy has hard time predicting the specifics of how Lucy2 will behave. What Lucy wants is to reason generally, i.e. prove something along the lines of “Lucy2 will only do good things” rather than analyze what Lucy2 will do on case-by-case basis (Yudkowsky & Herreshoff call it “the Vingean principle” after Vernor Vinge). Unfortunately, Lucy is barred from proving such a general theorem. Now, we can try to hack Lucy’s system of reasoning to accept self-modifications on weaker grounds than proving they are “good” (e.g. allow proving their goodness in a stronger formal system). But that would involve treating self-modification as something ontologically different than all other phenomena in the universe, which doesn’t seem to be a good thing. We would like Lucy to be naturalistic i.e. to reason about herself and her descendants in the same way she reasons about everything else in the universe. Otherwise, she will be handicapped. For example, imagine that instead of self-modifying she considers to break into another much more powerful computer and programming that computer to be Lucy2. Such a scenario should be treated just like a self-modification, but since it doesn’t fall under the definition of self-modification hard-coded by the programmers, Lucy is back to using her “general purpose” reasoning mode rather than her special self-modification mode. Same goes for encountering another agent. We can try to introduce “agent” as a special ontological category but defining it would be hard.
I have a suspicion human brains use something like the evidence logic I’m proposing. It feels that the concept of “proof” is much less native to human thinking than the concept of “argument for/against”.
I suppose she could reason that if she can’t prove that her formal system is consistent, then she can’t trust that the actions she can prove to be “good” are actually “good”. Or, even if she trusts her present self, she can’t trust her future selves, even if they are identical (well, they can never be completely identical if they accumulate memories).
Your argument works even for Lucy = Lucy_2.
It seems to me that the problem you are trying to solve is how to coordinate with other agents.
This has investigated in the context of program-equilibrium prisoner’s dilemma, and the only satisfactory solution I know is to form cliques based on computable under-approximations of functional equivalence between the program source codes (no, I don’t consider the FairBot/PrudentBot proposal a satisfactory solution, for reason I’ve explained here).
Generalize this to arbitrary problems and you get UDT or something alike.
Lucy works by trying to prove a given action is “good” and doing it if she’s successful. She doesn’t need to prove that “there is no proof A is not good” into order to do A. However it’s true she can’t trust her future selves. Indeed it can be a problem even without self-modifications. For example, if someone drops an anvil on Lucy, Lucy cannot reason that “I should dodge the anvil because if I get destroyed I won’t do good things in the future”. She can still have a hard-coded aversion to being destroyed, but this seems like a hack.
UDT assumes a system of assigning probabilities to mathematical sentences given a (possibly inconsistent) theory. For UDT to admit naturalistic trust, it needs that system to overcome the Loebian obstacle, like the system I propose here. Indeed the system I propose is intended as a component of UDT or UIM.
I think it’s worse than that. If your argument is correct, the type of AI you are describing can’t plan because it can’t trust its future selves to follow through with the plan, even if doing so wouldn’t require commitment. It would be limited to immediate utility maximization.
It seems that your argument proves too much. In fact, I think that Lucy can still prove useful properties about its successors even without proving the consistency of their and therefore her own, proof system.
That’s not how UDT works in my understanding, even though I shall admit that I’m not an expert on the subject. Do you have a reference?
We can avoid this problem if Lucy performs an action once it is the first in a provably “good” sequence of actions. This would allow her to dodge the anvil if it interferes with her immediate plans, but not on the general grounds of “a universe with Lucy is a better universe since Lucy is doing good things”.
I don’t have a reference which discusses UDT and the Loebian obstacle together. You can find a description of the AFAIK “latest and greatest” UDT here. UDT considers proofs in a formal system. If this system suffers from the Loebian obstacle this will lead to the kind of problems I discuss here. In fact, I haven’t stated it explicitly but I think of Lucy as a UDT agent: she considers possible actions as logical counterfactuals and computes expected utility based on that.