Here’s a concrete way you could try to get stable self-modification:
Suppose for concreteness that we have a C program, call it X, and that within the C program there is an array called “world_state” of length M and a floating point number called “utility”. A simple instantiation of X would look something like:
We would like to consider modifications to X where we replace chooseAction with some new method chooseAction2 to get a program X2. Suppose we want to ensure some condition such as: from the current world state, if we use X2 instead of X, then after some finite period of time the sequence of utilities we get from using chooseAction2 will always be larger than the corresponding sequence if we have used chooseAction. Abusing notation a bit, this is the same as verifying the statement:
“there exists N such that for all n > N, utility2(n) > utility(n)”
[although note that utility2 and utility have fairly complicated descriptions if you actually try to write them out]. Now I agree that reasoning about this for arbitrary choices of chooseAction and chooseAction2 will be quite difficult (probably undecidable although I haven’t proved that). But the key point is that I get to choose chooseAction2, and there are many decision procedures that can prove such a statement in special cases. For instance, I could partition the space of world states into finitely many pieces, write down a transition function that over-approximates the possible transitions (for instance, by having a transition from Piece1 to Piece2 if any element of Piece1 can transition to any element of Piece2). Then I only need to reason about finite automata and those are trivially decidable.
You could argue that this proof system is fairly weak, but again, the AI gets to tailor its choices of chooseAction2 to be easy to reason about. You could also argue that an AI that could only create instances of chooseAction2 that could be reasoned about as above would be severely hampered, but I think this overlooks the fact that such techniques have been extraordinarily successful at reasoning about fairly complex systems.
In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn’t seem so restrictive to me.
Finally, I agree that such a naieve proof strategy as “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world” does not work. I’m not proposing that. The proof system clearly has to do some actual work. But your argument seems to me to be saying “this extremely naieve strategy doesn’t work, let’s develop an entire subfield of logic to try to make it work” whereas I advocate “this extremely naieve strategy doesn’t work, let’s use a non-naieve approach instead, and look at the relevant literature on how to solve similar problems”.
I also agree that the idea of “logical uncertainty” is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty. But it seems like a mistake to me to tackle such problems without at least one, and preferably both, of: an understanding of the surrounding literature; experience with concrete instantiations of the problems at hand.
Sorry for being curmudgeonly there—I did afterwards wish that I had tempered that. The thing is that when you write something like
I also agree that the idea of “logical uncertainty” is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty.
that sounds to me like you’re painting MIRI as working on these topics just because it’s fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you’re supporting this by arguments that miss the point of what MIRI is trying to say. That’s why I found the example of program analysis so annoying—people who think that the halting problem means that program analysis is impossible really are misinformed (actually Rice’s theorem, really, but someone with this misconception wouldn’t be aware of that), both about the state of the field and about why these theorems say what they say. E.g., yes, of course your condition is undecidable as long as there is any choice f(s) of chooseAction2(s) that satisfies it; proof: let chooseAction2(s) be the program that checks whether chooseAction2(s) satisfies your criterion, and if yes return chooseAction(s), otherwise f(s). That’s how these proofs always go, and of course that doesn’t mean that there are no programs that are able to verify the condition for an interesting subclass of chooseAction2′s; the obvious interesting example is searching for a proof of the condition in ZFC, and the obvious boring example is that there is a giant look-up table which decides the condition for all choices of chooseAction2(s) of length less than L.
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough. I’m thinking this may be the case because you write:
In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn’t seem so restrictive to me.
First, that’s precisely the “obvious” strategy that’s the starting point for MIRI’s work.
Second, yes, Eliezer’s arguing that this isn’t good enough, but the reason isn’t that it there are some safe modifications which aren’t provably safe. The work around the Löbian obstacle has nothing to do with trying to work around undecidability. (I will admit that for a short period at the April workshop I thought this might be a good research direction, because I had my intuitions shaken by the existence of Paul’s system and got overly optimistic, but Paul quickly convinced me that this was an unfounded hope, and in any case the main work around the Löbian obstacle was never really related to this.) MIRI’s argument definitely isn’t that “the above algorithm can’t decide for all chooseAction2 whether they’re safe, therefore it probably can’t decide it for the kind of chooseAction2 we’re interested in, therefore it’s unacceptable”. If that’s how you’ve understood the argument, then I see why you would think that the program analysis example is relevant. (The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we’re interested in, but not because it’s unable to decide this for any generic chooseAction2.)
Third, you seem to imply that your proposal will only take safe actions. You haven’t given an argument for why we should think so, but the implication seems clear: You’re using a chooseAction that is obviously safe as long as it doesn’t rewrite itself, and it will only accept a proposed modification if it comes with a proof that it is safe, so if it does choose to rewrite itself then its successor will in fact be safe as well. Now I think this is fine reasoning, but you don’t seem to agree:
Finally, I agree that such a naieve proof strategy as “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world” does not work. I’m not proposing that. The proof system clearly has to do some actual work.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe, but you don’t think the AI should be able to use the same sort of reasoning. Eliezer’s argument is that this is in fact reasoning you want to use when building a self-improving AI: Yes, you can reason in more detail about how the AI you are building works, but this AI_1 will build an AI_2 and so on, and when proving that it’s ok to build AI_1 you don’t want to reason in detail about how AI_1,000,000 is going to work (which is built using design principles you don’t understand, by AI_999,999 which is much smarter than you); rather, you want to use general principles to reason that the because of the way AI_1,000,000 came to be, it has to be safe (because AI_999,999 only builds safe AIs, because it was built by AI_999,998 which only builds safe AIs...). But not only you need to reason like that, because you don’t know and aren’t smart enough to comprehend AI_1,000,000′s exact design; AI_1, which also isn’t that smart, will need to be able to use the same sort of reasoning. Hence, the interest in the Löbian obstacle.
There are caveats to add to this and parts of your comment I haven’t replied to, but I’m running into the same problem as you with your original comment in this thread, having already spent too much time on this. I’d be happy to elaborate if useful. For my part, I’d be interested in your reply to the other part of my comment: do you think I have localized our disagreement correctly?
Oh, one last point that I shouldn’t skip over: I assume the point about MIRI lacking “an understanding of the surrounding literature” refers to the thing about being tripped up at the July workshop by not knowing Gaifman’s work on logical uncertainty well enough. If so, I agree that that was an avoidable fail, but I don’t think it’s indicative of always ignoring the relevant literature or something like that. I’ll also admit that I still haven’t myself more with Gaifman’s work, but that’s because I’m not currently focusing on logical uncertainty, and I intend to do so in the future.
Sorry for being curmudgeonly there—I did afterwards wish that I had tempered that.
Don’t worry, I wasn’t offended :)
that sounds to me like you’re painting MIRI as working on these topics just because it’s fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you’re supporting this by arguments that miss the point of what MIRI is trying to say.
I don’t think that MIRI is working on these topics just because they are fun, and I apologize for implying that. I should note here that I respect the work that you and Paul have done, and as I said at the beginning I was somewhat hesitant to start this discussion at all, because I was worried that it would have a negative impact on either you / Paul’s reputation (regardless of whether my criticisms ended up being justified) or on our relationship. But in the end I decided that it was better to raise my objections in fairly raw form and deal with any damage later.
One possibility is that MIRI’s arguments actually do look that terrible to you
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe
I’m not sure I understand what you’re saying here, but I’m not convinced that this is the sort of reasoning I’d use. It seems like Paul’s argument is similar to yours, though, and I’m going to talk to him in person in a few days, so perhaps the most efficient thing will be for me to talk to him and then report back.
Do you think I have localized our disagreement correctly?
I don’t think that “whole brain emulations can safely self-modify” is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: “The argument in favor of studying Lob’s theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. Standard epistemic hygiene calls for trying to produce concrete examples to motivate this line of work. I have not seen this done by MIRI, and all of the examples I can think of, both from having done AI and verification work myself, and from looking at what my colleagues do in program analysis, points in the squarely opposite direction.”
When I say “failure to understand the surrounding literature”, I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
One possibility is that MIRI’s arguments actually do look that terrible to you
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
I’m not really sure what you mean by a “concrete instantiation”. I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can’t reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can’t write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that’s because I can’t write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they’re trying to verify—reasoning like “this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won’t loop forever”. But heuristically, while it seems possible to reason about the program you’re creating in this way, our task is to ensure that we’re creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won’t decide to blow up the world; it seems clear to me that this is going to require reasoning of the type “the program I’m creating is going to reason correctly about the program it is creating”, which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today’s automated verification techniques.
Writing this, I’m not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI’s arguments—e.g. there was that LW conversation you had with Eliezer where you pointed out that it’s possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don’t seem to be capable of this or to extend in some obvious way to dealing with this.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe
I’m not sure I understand what you’re saying here, but I’m not convinced that this is the sort of reasoning I’d use.
I’m fairly sure that the reason your brain goes “it would be safe if we only allow self-modifications when there’s a proof that they’re safe” is that you believe that if there’s a proof that a self-modification is safe, then it is safe—I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
I don’t think that “whole brain emulations can safely self-modify” is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: “The argument in favor of studying Lob’s theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. [...]”
Thanks for the reply! Thing is, I don’t think that ordinary human reasoning should run into that obstacle, and the “ordinary” is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don’t think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that “whole brain emulations can safely self-modify” might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb’s theorem would be relevant.
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don’t think it’s terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb’s theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn’t seem likely that AGI research will hit on a safe solution automatically. MIRI’s current research seems to me like a relatively promising way of trying to search for a solution that’s close to logical reasoning.
When I say “failure to understand the surrounding literature”, I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Ok, I think I probably don’t understand this yet, and making a post about it sounds like a good plan!
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
I don’t have time to reply to all of this right now, but since you explicitly requested a reply to:
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I’m happy to focus in on it).
We have an AI which controls a robotic arm inside a box. The box also contains various cooking tools and ingredients to make cake, and a big red button that kills a puppy.
We prefer to cake to no cake, and we prefer our cake to be delicious and moist cake, but above all we prefer the companion cu-… puppy to stay alive.
Therefore, we implement in the AI a “puppy safety module” (PSM) which vetoes any course of action proposed by the planning module if it determines that there is any non-negligible probability of the red button being depressed. When the PSM can’t make an accurate prediction it will always err on the safe side: things like throwing tools on the box walls where they can bounce unpredictably and hit the red button don’t get approved. Athena, the Greek goddess of wisdom, justice and math, came down from Mt. Olympus and debugged the code for PSM. She was also so kind to check all the compilers, OS and hardware for bugs and glitches.
Do you agree that with this AI the puppy is most likely safe?
Now, let’s assume that the cakes made by this AI aren’t very good. We don’t really know how to program it to make better cakes, we try our best reinforcement learning algorithms but it doesn’t improve much. Then somebody suggest we try to allow the AI to modify it own code: In addition to sending motor commands to the arm, the planning module can also edit a copy of the AI code and eventually replace the whole program with this successor. Of course, these actions must still be approved by the PSM, but they are otherwise unrestricted: in principle they can modify the PSM, or add a bypass, or get rid of it entirely.
Do you think that this AI is unsafe for the puppy?
Do you think it is safe but unable to improve itself in effective ways?
Suppose that in the recesses of the code there is an instantiation of the bubble sort algorithm. The planner proposes to change it with, say, merge sort. Do you think that the PSM would generally disapprove such change?
Do you think it would approve it, but it would still be unable to approve modifications that would be needed for significant improvement?
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough.
MIRI claim to have had an important insight on AI design (this so called “Löbian obstacle”) that experts in relevant fields (AI, model checking, automated theorem proving, etc.) didn’t have. MIRI attempted to communicate their insight, but so far the experts have mostly ignored MIRI claims or denied that they are likely to be important and relevant.
What is the most likely explanation for that? It seems that we can narrow it to two hypotheses: A) MIRI’s insight is really relevant and important to AI design, but communication with the experts failed because of some problem on MIRI’s side, or on the experts’ side (e.g. stubbornness, stupidity) or both (e.g. excessively different backgrounds). B) MIRI is mistaken about the value of their insight (possible psychological causes may include confirmation bias, Dunning–Kruger effect, groupthink, overconfident personalities, etc.).
I would say that, barring evidence to the contrary, hypothesis B is the most likely explanation.
What is the most likely explanation for that? It seems that we can narrow it to two hypotheses:
A) MIRI’s insight is really relevant and important to AI design, but communication with the experts failed because > of some problem on MIRI’s side, or on the experts’ side (e.g. stubbornness, stupidity) or both (e.g. excessively different backgrounds).
B) MIRI is mistaken about the value of their insight (possible psychological causes may include confirmation bias, Dunning–Kruger effect, groupthink, overconfident personalities, etc.).
I don’t believe these options are exhaustive. “Relevant and valuable” are subjective and time-varying. The published work might be interesting and useful down the line, but not help the problems that AI researchers are working on right now.
It usually takes a few years for the research community to assimilate strange new ideas—sometimes much more than a few years. This isn’t due to a failure on anybody’s part—it’s due to the fact that scientists pick problems where they have a reasonable prospect of success within a few years.
I would give MIRI at least a decade or two before evaluating whether their work had any mainstream traction.
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
Do you claim that MIRI is one or two decades ahead of mainstream researchers?
If the answer is no, then how does MIRI (or MIRI donors) evaluate now whether these lines of work are valuable towards advancing their stated goals?
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
Research has both ultimate goals (“machines that think”) and short-term goals (“machines that can parse spoken English”). My impression is that the MIRI agenda is relevant to the ultimate goal of AI research, but has only limited overlap with the things people are really working on in the short term. I haven’t seen MIRI work that looked directly relevant to existing work on theorem proving or model checking. (I don’t know much about automated theorem proving, but do know a bit about model checking.)
Do you claim that MIRI is one or two decades ahead of mainstream researchers?
It’s not a matter of “ahead”. Any research area is typically a bunch of separate tracks that proceed separately and eventually merge together or have interconnections. It might be several decades before the MIRI/self modifying AI track merges with the main line of AI or CS research. That’s not necessarily a sign anything is wrong. It took decades of improvement before formal verification or theorem proving become part of the computer science toolkit. I would consider MIRI a success if it follows a similar trajectory.
If the answer is no, then how does MIRI (or MIRI donors) evaluate now whether these lines of work are valuable towards advancing their stated goals?
I can’t imagine any really credible assurance that “this basic research is definitely useful,” for any basic research. The ultimate goal “safe self modifying AI” is too remote to have any idea if we’re on the right track. But if MIRI, motivated by that goal, does neat stuff, I think it’s a safe bet that (A) the people doing the work are clueful, and (B) their work was at least potentially useful in dealing with AI risks. And potentially useful is the best assurance anybody can ever give.
I’m a computer systems guy, not a theorist or AI researcher, but my opinion of MIRI has gradually shifted from “slightly crankish” to “there are some interesting questions here and MIRI might be doing useful work on them that nobody else is currently doing.” My impression is a number of mainstream computer scientists have similar views.
Eliezer recently gave a talk at MIT. If the audience threw food at the stage, I would consider that evidence for MIRI being crankish. If knowledgeable CS types showed up and were receptive or interested, I would consider that a strong vote of confidence. Anybody able to comment?
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
It’s definitely not a goal of mainstream AI, and not even a goal of most AGI researchers, to create self-modifying AI that provably preserves its goals. MIRI’s work on this topic doesn’t seem relevant to what mainstream AI researchers want to achieve.
Zooming out from MIRI’s technical work to MIRI’s general mission, it’s certainly true that MIRI’s failure to convince the AI world of the importance of preventing unFriendly AI is Bayesian evidence against MIRI’s perspective being correct. Personally, I don’t find this evidence strong enough to make me think that preventing unFriendly AI isn’t worth working on.
Also, two more points why MIRI isn’t that likely to produce research AI researchers will see as a direct boon to their field: One, stuff that’s close to something people are already trying to do is more often already worked on; the stuff that people aren’t working on seem more important for MIRI to work on. And two, AGI researchers in particular are particularly interested in results that get us closer to AGI, and MIRI is trying to work on topics that can be published about without bringing the world closer to AGI.
MIRI claim to have had an important insight on AI design (this so called “Löbian obstacle”) that experts in relevant fields (AI, model checking, automated theorem proving, etc.) didn’t have. MIRI attempted to communicate their insight, but so far the experts have mostly ignored MIRI claims or denied that they are likely to be important and relevant.
I wouldn’t say MIRI has tried very hard yet to communicate about the Lobian obstacle to people in the relevant fields. E.g. we haven’t published about the Lobian obstacle in a journal or conference proceedings.
Part of the reason for that is that we don’t expect experts in these fields to be very interested in it. The Lobian obstacle is aiming at better understanding of strongly self-modifying systems, which won’t exist for at least 15 years, and probably longer than that.
Part of the reason for that is that we don’t expect experts in these fields to be very interested in it. The Lobian obstacle is aiming at better understanding of strongly self-modifying systems, which won’t exist for at least 15 years, and probably longer than that.
I agree the AI community won’t be very interested. But it might be worth sending it to some theoretical computer science venue—STOC, say—instead. If nothing else, it would give useful information about how receptive academics are to the topic.
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough
I look forward to a clear, detailed explanation of MIRI’s thinking on this subject. In particular this counter-intuitive result:
The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we’re interested in
I’m in the early stages of writing up my own work on the Löbian obstacle for publication, which will need to include its own (more condensed, rather than expanded) exposition of the Löbian obstacle; but I liked Eliezer’s article, so it would be helpful to know why you didn’t think it argued the point well enough.
I have, although formal logic is not my field so please excuse me if I have misunderstood it.
Eliezer does not demonstrate that overcoming the Löbian obstacle is necessary in the construction of tiling agents, he rather assumes it. No form of program verification is actually required, if you do not use the structure of a logical agent. Consider, for example, the GOLUM architecture[1] which is a form of tiling agent that proceeds by direct experimentation (simulation). It does not require an ability to prove logical facts about the soundness and behavior of its offspring, just an ability to run them in simulation. Of course logical program analysis helps in focusing in on the situations which give rise to differing behavior between the two programs, but there are no Gödelian difficulties there (even if there were you could fall back on probabilistic sampling of environments, searching for setups which trigger different results).
The MIRI argument, as I understand it is: “a program which tried to predict the result of modifying itself runs into a Löbian obstacle; we need to overcome the Löbian obstacle to create self-modifying programs with steadfast goal systems.” (I hope I am not constructing a strawman in simplifying it as such.) The problem comes from the implicit assumption that the self-modifying agent will use methods of formal logic to reason about the future actions of its modified self. This need not be the case! There are other methods which work well in practice, converge on stable solutions under the right circumstances, and have been well explored in theory and in practice.
I’m reminded of the apocryphal story of two space-age engineers that meet after the fall of the Soviet Union. The American, who oversaw a $1.5 million programme to develop the “Astronaut Pen” which would write in hard vacuum and microgravity environments, was curious to know how his Russian counterpart solved the same problem. “Simple,” he replied, “we used a pencil.”
You could expend significant time, energy, and donations to solve the problem of Löbian obstacles in the context of tiling agents for self-modifying AI. Or you could use an existing off-the-shelf solution that solves the problem in a different way.
Glad to see you’re doing well, Benja :)
Here’s a concrete way you could try to get stable self-modification:
Suppose for concreteness that we have a C program, call it X, and that within the C program there is an array called “world_state” of length M and a floating point number called “utility”. A simple instantiation of X would look something like:
while(true){ action = chooseAction(worldState); world_state = propgateWorldState(worldState, action); utility = calculateUtility(worldState); }
We would like to consider modifications to X where we replace chooseAction with some new method chooseAction2 to get a program X2. Suppose we want to ensure some condition such as: from the current world state, if we use X2 instead of X, then after some finite period of time the sequence of utilities we get from using chooseAction2 will always be larger than the corresponding sequence if we have used chooseAction. Abusing notation a bit, this is the same as verifying the statement:
“there exists N such that for all n > N, utility2(n) > utility(n)”
[although note that utility2 and utility have fairly complicated descriptions if you actually try to write them out]. Now I agree that reasoning about this for arbitrary choices of chooseAction and chooseAction2 will be quite difficult (probably undecidable although I haven’t proved that). But the key point is that I get to choose chooseAction2, and there are many decision procedures that can prove such a statement in special cases. For instance, I could partition the space of world states into finitely many pieces, write down a transition function that over-approximates the possible transitions (for instance, by having a transition from Piece1 to Piece2 if any element of Piece1 can transition to any element of Piece2). Then I only need to reason about finite automata and those are trivially decidable.
You could argue that this proof system is fairly weak, but again, the AI gets to tailor its choices of chooseAction2 to be easy to reason about. You could also argue that an AI that could only create instances of chooseAction2 that could be reasoned about as above would be severely hampered, but I think this overlooks the fact that such techniques have been extraordinarily successful at reasoning about fairly complex systems.
In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn’t seem so restrictive to me.
Finally, I agree that such a naieve proof strategy as “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world” does not work. I’m not proposing that. The proof system clearly has to do some actual work. But your argument seems to me to be saying “this extremely naieve strategy doesn’t work, let’s develop an entire subfield of logic to try to make it work” whereas I advocate “this extremely naieve strategy doesn’t work, let’s use a non-naieve approach instead, and look at the relevant literature on how to solve similar problems”.
I also agree that the idea of “logical uncertainty” is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty. But it seems like a mistake to me to tackle such problems without at least one, and preferably both, of: an understanding of the surrounding literature; experience with concrete instantiations of the problems at hand.
Sorry for being curmudgeonly there—I did afterwards wish that I had tempered that. The thing is that when you write something like
that sounds to me like you’re painting MIRI as working on these topics just because it’s fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you’re supporting this by arguments that miss the point of what MIRI is trying to say. That’s why I found the example of program analysis so annoying—people who think that the halting problem means that program analysis is impossible really are misinformed (actually Rice’s theorem, really, but someone with this misconception wouldn’t be aware of that), both about the state of the field and about why these theorems say what they say. E.g., yes, of course your condition is undecidable as long as there is any choice f(s) of chooseAction2(s) that satisfies it; proof: let chooseAction2(s) be the program that checks whether chooseAction2(s) satisfies your criterion, and if yes return chooseAction(s), otherwise f(s). That’s how these proofs always go, and of course that doesn’t mean that there are no programs that are able to verify the condition for an interesting subclass of chooseAction2′s; the obvious interesting example is searching for a proof of the condition in ZFC, and the obvious boring example is that there is a giant look-up table which decides the condition for all choices of chooseAction2(s) of length less than L.
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough. I’m thinking this may be the case because you write:
First, that’s precisely the “obvious” strategy that’s the starting point for MIRI’s work.
Second, yes, Eliezer’s arguing that this isn’t good enough, but the reason isn’t that it there are some safe modifications which aren’t provably safe. The work around the Löbian obstacle has nothing to do with trying to work around undecidability. (I will admit that for a short period at the April workshop I thought this might be a good research direction, because I had my intuitions shaken by the existence of Paul’s system and got overly optimistic, but Paul quickly convinced me that this was an unfounded hope, and in any case the main work around the Löbian obstacle was never really related to this.) MIRI’s argument definitely isn’t that “the above algorithm can’t decide for all chooseAction2 whether they’re safe, therefore it probably can’t decide it for the kind of chooseAction2 we’re interested in, therefore it’s unacceptable”. If that’s how you’ve understood the argument, then I see why you would think that the program analysis example is relevant. (The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we’re interested in, but not because it’s unable to decide this for any generic chooseAction2.)
Third, you seem to imply that your proposal will only take safe actions. You haven’t given an argument for why we should think so, but the implication seems clear: You’re using a chooseAction that is obviously safe as long as it doesn’t rewrite itself, and it will only accept a proposed modification if it comes with a proof that it is safe, so if it does choose to rewrite itself then its successor will in fact be safe as well. Now I think this is fine reasoning, but you don’t seem to agree:
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe, but you don’t think the AI should be able to use the same sort of reasoning. Eliezer’s argument is that this is in fact reasoning you want to use when building a self-improving AI: Yes, you can reason in more detail about how the AI you are building works, but this AI_1 will build an AI_2 and so on, and when proving that it’s ok to build AI_1 you don’t want to reason in detail about how AI_1,000,000 is going to work (which is built using design principles you don’t understand, by AI_999,999 which is much smarter than you); rather, you want to use general principles to reason that the because of the way AI_1,000,000 came to be, it has to be safe (because AI_999,999 only builds safe AIs, because it was built by AI_999,998 which only builds safe AIs...). But not only you need to reason like that, because you don’t know and aren’t smart enough to comprehend AI_1,000,000′s exact design; AI_1, which also isn’t that smart, will need to be able to use the same sort of reasoning. Hence, the interest in the Löbian obstacle.
There are caveats to add to this and parts of your comment I haven’t replied to, but I’m running into the same problem as you with your original comment in this thread, having already spent too much time on this. I’d be happy to elaborate if useful. For my part, I’d be interested in your reply to the other part of my comment: do you think I have localized our disagreement correctly?
Oh, one last point that I shouldn’t skip over: I assume the point about MIRI lacking “an understanding of the surrounding literature” refers to the thing about being tripped up at the July workshop by not knowing Gaifman’s work on logical uncertainty well enough. If so, I agree that that was an avoidable fail, but I don’t think it’s indicative of always ignoring the relevant literature or something like that. I’ll also admit that I still haven’t myself more with Gaifman’s work, but that’s because I’m not currently focusing on logical uncertainty, and I intend to do so in the future.
Don’t worry, I wasn’t offended :)
I don’t think that MIRI is working on these topics just because they are fun, and I apologize for implying that. I should note here that I respect the work that you and Paul have done, and as I said at the beginning I was somewhat hesitant to start this discussion at all, because I was worried that it would have a negative impact on either you / Paul’s reputation (regardless of whether my criticisms ended up being justified) or on our relationship. But in the end I decided that it was better to raise my objections in fairly raw form and deal with any damage later.
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
I’m not sure I understand what you’re saying here, but I’m not convinced that this is the sort of reasoning I’d use. It seems like Paul’s argument is similar to yours, though, and I’m going to talk to him in person in a few days, so perhaps the most efficient thing will be for me to talk to him and then report back.
I don’t think that “whole brain emulations can safely self-modify” is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: “The argument in favor of studying Lob’s theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. Standard epistemic hygiene calls for trying to produce concrete examples to motivate this line of work. I have not seen this done by MIRI, and all of the examples I can think of, both from having done AI and verification work myself, and from looking at what my colleagues do in program analysis, points in the squarely opposite direction.”
When I say “failure to understand the surrounding literature”, I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
I’m not really sure what you mean by a “concrete instantiation”. I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can’t reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can’t write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that’s because I can’t write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they’re trying to verify—reasoning like “this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won’t loop forever”. But heuristically, while it seems possible to reason about the program you’re creating in this way, our task is to ensure that we’re creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won’t decide to blow up the world; it seems clear to me that this is going to require reasoning of the type “the program I’m creating is going to reason correctly about the program it is creating”, which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today’s automated verification techniques.
Writing this, I’m not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI’s arguments—e.g. there was that LW conversation you had with Eliezer where you pointed out that it’s possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don’t seem to be capable of this or to extend in some obvious way to dealing with this.
I’m fairly sure that the reason your brain goes “it would be safe if we only allow self-modifications when there’s a proof that they’re safe” is that you believe that if there’s a proof that a self-modification is safe, then it is safe—I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
Thanks for the reply! Thing is, I don’t think that ordinary human reasoning should run into that obstacle, and the “ordinary” is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don’t think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that “whole brain emulations can safely self-modify” might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb’s theorem would be relevant.
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don’t think it’s terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb’s theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn’t seem likely that AGI research will hit on a safe solution automatically. MIRI’s current research seems to me like a relatively promising way of trying to search for a solution that’s close to logical reasoning.
Ok, I think I probably don’t understand this yet, and making a post about it sounds like a good plan!
No problem, and hope so as well.
I don’t have time to reply to all of this right now, but since you explicitly requested a reply to:
The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I’m happy to focus in on it).
Thanks!
Thought experiment:
We have an AI which controls a robotic arm inside a box. The box also contains various cooking tools and ingredients to make cake, and a big red button that kills a puppy.
We prefer to cake to no cake, and we prefer our cake to be delicious and moist cake, but above all we prefer the companion cu-… puppy to stay alive.
Therefore, we implement in the AI a “puppy safety module” (PSM) which vetoes any course of action proposed by the planning module if it determines that there is any non-negligible probability of the red button being depressed.
When the PSM can’t make an accurate prediction it will always err on the safe side: things like throwing tools on the box walls where they can bounce unpredictably and hit the red button don’t get approved.
Athena, the Greek goddess of wisdom, justice and math, came down from Mt. Olympus and debugged the code for PSM. She was also so kind to check all the compilers, OS and hardware for bugs and glitches.
Do you agree that with this AI the puppy is most likely safe?
Now, let’s assume that the cakes made by this AI aren’t very good. We don’t really know how to program it to make better cakes, we try our best reinforcement learning algorithms but it doesn’t improve much. Then somebody suggest we try to allow the AI to modify it own code:
In addition to sending motor commands to the arm, the planning module can also edit a copy of the AI code and eventually replace the whole program with this successor. Of course, these actions must still be approved by the PSM, but they are otherwise unrestricted: in principle they can modify the PSM, or add a bypass, or get rid of it entirely.
Do you think that this AI is unsafe for the puppy?
Do you think it is safe but unable to improve itself in effective ways?
Since the PSM was designed without self-modification in mind, “safe but unable to improve itself in effective ways”.
(Not sure how this thought experiment helps the discussion along.)
Can you please motivate?
Suppose that in the recesses of the code there is an instantiation of the bubble sort algorithm. The planner proposes to change it with, say, merge sort. Do you think that the PSM would generally disapprove such change?
Do you think it would approve it, but it would still be unable to approve modifications that would be needed for significant improvement?
MIRI claim to have had an important insight on AI design (this so called “Löbian obstacle”) that experts in relevant fields (AI, model checking, automated theorem proving, etc.) didn’t have. MIRI attempted to communicate their insight, but so far the experts have mostly ignored MIRI claims or denied that they are likely to be important and relevant.
What is the most likely explanation for that? It seems that we can narrow it to two hypotheses:
A) MIRI’s insight is really relevant and important to AI design, but communication with the experts failed because of some problem on MIRI’s side, or on the experts’ side (e.g. stubbornness, stupidity) or both (e.g. excessively different backgrounds).
B) MIRI is mistaken about the value of their insight (possible psychological causes may include confirmation bias, Dunning–Kruger effect, groupthink, overconfident personalities, etc.).
I would say that, barring evidence to the contrary, hypothesis B is the most likely explanation.
I don’t believe these options are exhaustive. “Relevant and valuable” are subjective and time-varying. The published work might be interesting and useful down the line, but not help the problems that AI researchers are working on right now.
It usually takes a few years for the research community to assimilate strange new ideas—sometimes much more than a few years. This isn’t due to a failure on anybody’s part—it’s due to the fact that scientists pick problems where they have a reasonable prospect of success within a few years.
I would give MIRI at least a decade or two before evaluating whether their work had any mainstream traction.
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
Do you claim that MIRI is one or two decades ahead of mainstream researchers?
If the answer is no, then how does MIRI (or MIRI donors) evaluate now whether these lines of work are valuable towards advancing their stated goals?
Research has both ultimate goals (“machines that think”) and short-term goals (“machines that can parse spoken English”). My impression is that the MIRI agenda is relevant to the ultimate goal of AI research, but has only limited overlap with the things people are really working on in the short term. I haven’t seen MIRI work that looked directly relevant to existing work on theorem proving or model checking. (I don’t know much about automated theorem proving, but do know a bit about model checking.)
It’s not a matter of “ahead”. Any research area is typically a bunch of separate tracks that proceed separately and eventually merge together or have interconnections. It might be several decades before the MIRI/self modifying AI track merges with the main line of AI or CS research. That’s not necessarily a sign anything is wrong. It took decades of improvement before formal verification or theorem proving become part of the computer science toolkit. I would consider MIRI a success if it follows a similar trajectory.
I can’t imagine any really credible assurance that “this basic research is definitely useful,” for any basic research. The ultimate goal “safe self modifying AI” is too remote to have any idea if we’re on the right track. But if MIRI, motivated by that goal, does neat stuff, I think it’s a safe bet that (A) the people doing the work are clueful, and (B) their work was at least potentially useful in dealing with AI risks. And potentially useful is the best assurance anybody can ever give.
I’m a computer systems guy, not a theorist or AI researcher, but my opinion of MIRI has gradually shifted from “slightly crankish” to “there are some interesting questions here and MIRI might be doing useful work on them that nobody else is currently doing.” My impression is a number of mainstream computer scientists have similar views.
Eliezer recently gave a talk at MIT. If the audience threw food at the stage, I would consider that evidence for MIRI being crankish. If knowledgeable CS types showed up and were receptive or interested, I would consider that a strong vote of confidence. Anybody able to comment?
It’s definitely not a goal of mainstream AI, and not even a goal of most AGI researchers, to create self-modifying AI that provably preserves its goals. MIRI’s work on this topic doesn’t seem relevant to what mainstream AI researchers want to achieve.
Zooming out from MIRI’s technical work to MIRI’s general mission, it’s certainly true that MIRI’s failure to convince the AI world of the importance of preventing unFriendly AI is Bayesian evidence against MIRI’s perspective being correct. Personally, I don’t find this evidence strong enough to make me think that preventing unFriendly AI isn’t worth working on.
Also, two more points why MIRI isn’t that likely to produce research AI researchers will see as a direct boon to their field: One, stuff that’s close to something people are already trying to do is more often already worked on; the stuff that people aren’t working on seem more important for MIRI to work on. And two, AGI researchers in particular are particularly interested in results that get us closer to AGI, and MIRI is trying to work on topics that can be published about without bringing the world closer to AGI.
I wouldn’t say MIRI has tried very hard yet to communicate about the Lobian obstacle to people in the relevant fields. E.g. we haven’t published about the Lobian obstacle in a journal or conference proceedings.
Part of the reason for that is that we don’t expect experts in these fields to be very interested in it. The Lobian obstacle is aiming at better understanding of strongly self-modifying systems, which won’t exist for at least 15 years, and probably longer than that.
I agree the AI community won’t be very interested. But it might be worth sending it to some theoretical computer science venue—STOC, say—instead. If nothing else, it would give useful information about how receptive academics are to the topic.
I look forward to a clear, detailed explanation of MIRI’s thinking on this subject. In particular this counter-intuitive result:
deserves some technical elaboration.
Mark, have you read Eliezer’s article about the Löbian obstacle, and what was your reaction to it?
I’m in the early stages of writing up my own work on the Löbian obstacle for publication, which will need to include its own (more condensed, rather than expanded) exposition of the Löbian obstacle; but I liked Eliezer’s article, so it would be helpful to know why you didn’t think it argued the point well enough.
I have, although formal logic is not my field so please excuse me if I have misunderstood it.
Eliezer does not demonstrate that overcoming the Löbian obstacle is necessary in the construction of tiling agents, he rather assumes it. No form of program verification is actually required, if you do not use the structure of a logical agent. Consider, for example, the GOLUM architecture[1] which is a form of tiling agent that proceeds by direct experimentation (simulation). It does not require an ability to prove logical facts about the soundness and behavior of its offspring, just an ability to run them in simulation. Of course logical program analysis helps in focusing in on the situations which give rise to differing behavior between the two programs, but there are no Gödelian difficulties there (even if there were you could fall back on probabilistic sampling of environments, searching for setups which trigger different results).
The MIRI argument, as I understand it is: “a program which tried to predict the result of modifying itself runs into a Löbian obstacle; we need to overcome the Löbian obstacle to create self-modifying programs with steadfast goal systems.” (I hope I am not constructing a strawman in simplifying it as such.) The problem comes from the implicit assumption that the self-modifying agent will use methods of formal logic to reason about the future actions of its modified self. This need not be the case! There are other methods which work well in practice, converge on stable solutions under the right circumstances, and have been well explored in theory and in practice.
I’m reminded of the apocryphal story of two space-age engineers that meet after the fall of the Soviet Union. The American, who oversaw a $1.5 million programme to develop the “Astronaut Pen” which would write in hard vacuum and microgravity environments, was curious to know how his Russian counterpart solved the same problem. “Simple,” he replied, “we used a pencil.”
You could expend significant time, energy, and donations to solve the problem of Löbian obstacles in the context of tiling agents for self-modifying AI. Or you could use an existing off-the-shelf solution that solves the problem in a different way.
[1] http://goertzel.org/GOLEM.pdf
Doesn’t the non-apocryphal version of that story have some relevance?
http://en.wikipedia.org/wiki/Space_Pen
http://www.snopes.com/business/genius/spacepen.asp
Using a space pencil could cause your spaceship to light on fire. Sometimes it pays to be careful.