Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
One possibility is that MIRI’s arguments actually do look that terrible to you
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
I’m not really sure what you mean by a “concrete instantiation”. I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can’t reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can’t write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that’s because I can’t write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they’re trying to verify—reasoning like “this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won’t loop forever”. But heuristically, while it seems possible to reason about the program you’re creating in this way, our task is to ensure that we’re creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won’t decide to blow up the world; it seems clear to me that this is going to require reasoning of the type “the program I’m creating is going to reason correctly about the program it is creating”, which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today’s automated verification techniques.
Writing this, I’m not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI’s arguments—e.g. there was that LW conversation you had with Eliezer where you pointed out that it’s possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don’t seem to be capable of this or to extend in some obvious way to dealing with this.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe
I’m not sure I understand what you’re saying here, but I’m not convinced that this is the sort of reasoning I’d use.
I’m fairly sure that the reason your brain goes “it would be safe if we only allow self-modifications when there’s a proof that they’re safe” is that you believe that if there’s a proof that a self-modification is safe, then it is safe—I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
I don’t think that “whole brain emulations can safely self-modify” is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: “The argument in favor of studying Lob’s theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. [...]”
Thanks for the reply! Thing is, I don’t think that ordinary human reasoning should run into that obstacle, and the “ordinary” is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don’t think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that “whole brain emulations can safely self-modify” might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb’s theorem would be relevant.
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don’t think it’s terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb’s theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn’t seem likely that AGI research will hit on a safe solution automatically. MIRI’s current research seems to me like a relatively promising way of trying to search for a solution that’s close to logical reasoning.
When I say “failure to understand the surrounding literature”, I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Ok, I think I probably don’t understand this yet, and making a post about it sounds like a good plan!
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
I don’t have time to reply to all of this right now, but since you explicitly requested a reply to:
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I’m happy to focus in on it).
Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
I’m not really sure what you mean by a “concrete instantiation”. I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can’t reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can’t write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that’s because I can’t write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they’re trying to verify—reasoning like “this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won’t loop forever”. But heuristically, while it seems possible to reason about the program you’re creating in this way, our task is to ensure that we’re creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won’t decide to blow up the world; it seems clear to me that this is going to require reasoning of the type “the program I’m creating is going to reason correctly about the program it is creating”, which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today’s automated verification techniques.
Writing this, I’m not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI’s arguments—e.g. there was that LW conversation you had with Eliezer where you pointed out that it’s possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don’t seem to be capable of this or to extend in some obvious way to dealing with this.
I’m fairly sure that the reason your brain goes “it would be safe if we only allow self-modifications when there’s a proof that they’re safe” is that you believe that if there’s a proof that a self-modification is safe, then it is safe—I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
Thanks for the reply! Thing is, I don’t think that ordinary human reasoning should run into that obstacle, and the “ordinary” is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don’t think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that “whole brain emulations can safely self-modify” might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb’s theorem would be relevant.
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don’t think it’s terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb’s theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn’t seem likely that AGI research will hit on a safe solution automatically. MIRI’s current research seems to me like a relatively promising way of trying to search for a solution that’s close to logical reasoning.
Ok, I think I probably don’t understand this yet, and making a post about it sounds like a good plan!
No problem, and hope so as well.
I don’t have time to reply to all of this right now, but since you explicitly requested a reply to:
The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I’m happy to focus in on it).
Thanks!