Why frame this problem as about tiling/self-modification instead of planning/self-prediction? If you do the latter though, the problem looks more like an AGI (or AI capability) problem than an FAI (or AI safety) problem, which makes me wonder if it’s really a good idea to publicize the problem and invite more people to work on it publicly.
Regarding section 4.3 on probabilistic reflection, I didn’t get a good sense from the paper of how Christiano et al’s formalism relates to the concrete problem of AI self-modification or self-prediction. For example what are the functions P and p supposed to translate to in terms of an AI and its descendent or future self?
One argument in favor of this being relevant specifically to FAI is that evolution kludged up us, so there is no strong reason to think that AGI projects with an incomplete understanding of the problem space will eventually kludge up an AGI that is able to solve these problems itself and successfully navigate an intelligence explosion—and then paperclip the universe, since the incomplete understanding of the human researchers creating the seed AI wouldn’t suffice for giving the seed AI stable goals. I.e., solving this in some way looks probably necessary for reaching AI safety at all, but only possibly helpful for AI capability.
I’m not entirely unworried about that concern, but I’m less worried about it than about making AGI more interesting by doing interesting in-principle work on it, and I currently feel that even the latter danger is outweighed by the danger of not tackling the object-level problems early enough to actually make progress before it’s too late.
One argument in favor of this being relevant specifically to FAI is that evolution kludged up us, so there is no strong reason to think that AGI projects with an incomplete understanding of the problem space will eventually kludge up an AGI that is able to solve these problems itself and successfully navigate an intelligence explosion—and then paperclip the universe, since the incomplete understanding of the human researchers creating the seed AI wouldn’t suffice for giving the seed AI stable goals.
This sentence has kind of a confusing structure and I’m having trouble understanding the logic of your argument. Could you rewrite it? Also, part of my thinking, which I’m not sure if you’ve addressed, is that an AGI that fails the Lobian obstacle isn’t just unable to stably self-modify, it’s unable to do even the simplest kind of planning because it can’t predict that its future selves won’t do something crazy. A “successful” (ETA: meaning one that FOOMs) AGI project has to solve this planning/self-prediction problem somehow. Why wouldn’t that solution also apply to the self-modification problem?
Sorry for being confusing, and thanks for giving me a chance to try again! (I did write that comment too quickly due to lack of time.)
So, my point is, I think that there is very little reason to think that evolution somehow had to solve the Löbstacle in order to produce humans. We run into the Löbstacle when we try to use the standard foundations of mathematics (first-order logic + PA or ZFC) in the obvious way to make a self-modifying agent that will continue to follow a given goal after having gone through a very large number of self-modifications. We don’t currently have any framework not subject to this problem, and we need one if we want to build a Friendly seed AI. Evolution didn’t have to solve this problem. It’s true that evolution did have to solve the planning/self-prediction problem, but it didn’t have to solve it with extremely high reliability. I see very little reason to think that if we understood how evolution solved the problem it solved, we would then be really close to having a satisfactory Löbstacle-free decision theory to use in a Friendly seed AI—and thus, conversely, I see little reason to think that an AGI project must solve the Löbstacle in order to solve the planning/prediction problem as well as evolution did.
I can more easily conceive of the possibility (but I think it rather unlikely, too) that solving the Löbstacle is fundamentally necessary to build an agent that can go through millions of rewrites without running out of steam: perhaps without solving the Löbstacle, each rewrite step will have an independent probability of making the machine wirehead (for example), so an AGI doing no better than evolution will almost certainly wirehead during an intelligence explosion. But in this scenario, since evolution build us, an AGI project might build an AI that solves the planning/self-prediction as well as we do, and that AI might then go and solve the Löbstacle and go through a billion self-modifications and take over the world. (The human operators might intervene and un-wirehead it every 50,000 rewrites or so until it’s figured out a solution to the Löbstacle, for example.) So even in this scenario, the Löbstacle doesn’t seem a barrier to AI capability to me; but it is a barrier to FAI, because if it’s the AI that eventually solves the Löbstacle, the superintelligence down the line will have the values of the AI at the time it’s solved the problem. This was what I intended to say by saying that the AGI would “successfully navigate an intelligence explosion—and then paperclip the universe”.
(On the other hand, while I only think of the above as an outside possibility, I think there’s more than an outside possibility that a clean reflective decision theory could be helpful for an AGI project, even if I don’t think it’s a necessary prerequisite. So I’m not entirely unsympathetic to your concerns.)
Does the above help to clarify the argument I had in mind?
So, my point is, I think that there is very little reason to think that evolution somehow had to solve the Löbstacle in order to produce humans.
So you think that humans do not have a built-in solution to the Löbstacle, and you must also think we are capable of building an FAI that does have a built-in solution to the Löbstacle. That means an intelligence without a solution to the Löbstacle can produce another intelligence that shares its values and does have a solution to the Löbstacle. But then why is it necessary for us to solve this problem? (You said earlier “solving this in some way looks probably necessary for reaching AI safety at all, but only possibly helpful for AI capability.”) Why can’t we instead built an FAI without solving this problem, and depend on the FAI to solve the problem while it’s designing the next generation FAI?
Also earlier you said
I currently feel that even the latter danger is outweighed by the danger of not tackling the object-level problems early enough to actually make progress before it’s too late.
I’ve been arguing with Eliezer and Paul about this recently, and thought that I should get the details of your views too. Have you been following the discussions under my most recent post?
So you think that humans do not have a built-in solution to the Löbstacle, and you must also think we are capable of building an FAI that does have a built-in solution to the Löbstacle. That means an intelligence without a solution to the Löbstacle can produce another intelligence that shares its values and does have a solution to the Löbstacle.
Yup.
But then why is it necessary for us to solve this problem? [...] Why can’t we instead built an FAI without solving this problem, and depend on the FAI to solve the problem while it’s designing the next generation FAI?
I have two main reasons in mind. First, if you are willing to grant that (a) this is a problem that would require humans years of serial research to solve and (b) that it looks much easier to build this into an AI designed from scratch rather than bolting it on to an existing AI design that was created without taking these considerations into account, but you still think that (c) it would be a good plan to have the first-generation FAI solve this problem when building the next-generation FAI, then it seems that you need to assume that the FAI will be much better at AGI design than its human designers before it executes its first self-rewrite, since the human team would by assumption still need years to solve the problem at that point and the plan wouldn’t be particularly helpful if the first-generation FAI would need a similar amount of time or longer. But it seems unlikely to me that we first need to build ultraintelligent machines a la I.J. Good, far surpassing humans, before we can get an intelligence explosion: it seems to me that most of the probability mass should be in the required level of AGI research ability being ⇐ the level of the human research team working on the AGI. I admit that one possible strategy could be to continue having humans improve the initial FAI until it is superintelligent and then ask it to write a successor from scratch, solving the Löbstacle in the process, but it doesn’t seem particularly likely that this is cheaper than solving the problem beforehand.
Second, if we followed this plan, when building the initial FAI we would be unable to use mathematical logic (or other tools sufficiently similar to be subject to the same issues) in a straight-forward way when having it reason about its potential successor. This cuts off a large part of design-space that I’d naturally be looking to. Yes, if we can do it then it’s possible in principle to get an FAI to do it, but mimicking human reasoning doesn’t seem likely to me to be the easiest way to build a safe AGI.
Have you been following the discussions under my most recent post?
I agree with you that relying on an FAI team to solve a large number of philosophical problems correctly seems dangerous, although I’m sympathetic to Eliezer’s criticism of your outside-view arguments—I essentially agree with your conclusions, but I think I use more inside-view reasoning to arrive at them (would need to think longer to tease this apart). I agree with Paul that something like CEV for philosophy in addition to values should probably part of an FAI design. I agree with you that progress in metaphilosophy would be very valuable, but I do not have any concrete leads to follow. But I think that having good solutions to some of these problems is not unlikely to be helpful for FAI design (and more helpful to FAI than uFAI) so I still think that some amount of work allocated to these philosophical problems looks like a good thing; and I also think that working on these problems does on average reduce the probability of making a bad mistake even if we manage to have the FAI do philosophy itself and have it checked by “coherent extrapolated philosophy”.
You quoted my earlier comment that I think that making object-level progress is important enough that it seems a net positive despite making AGI research more interesting, but I don’t really feel that your post or the discussion below that contains much in the way of arguments about that—could you elaborate on the connection?
Why frame this problem as about tiling/self-modification instead of planning/self-prediction? If you do the latter though, the problem looks more like an AGI (or AI capability) problem than an FAI (or AI safety) problem, which makes me wonder if it’s really a good idea to publicize the problem and invite more people to work on it publicly.
Regarding section 4.3 on probabilistic reflection, I didn’t get a good sense from the paper of how Christiano et al’s formalism relates to the concrete problem of AI self-modification or self-prediction. For example what are the functions P and p supposed to translate to in terms of an AI and its descendent or future self?
One argument in favor of this being relevant specifically to FAI is that evolution kludged up us, so there is no strong reason to think that AGI projects with an incomplete understanding of the problem space will eventually kludge up an AGI that is able to solve these problems itself and successfully navigate an intelligence explosion—and then paperclip the universe, since the incomplete understanding of the human researchers creating the seed AI wouldn’t suffice for giving the seed AI stable goals. I.e., solving this in some way looks probably necessary for reaching AI safety at all, but only possibly helpful for AI capability.
I’m not entirely unworried about that concern, but I’m less worried about it than about making AGI more interesting by doing interesting in-principle work on it, and I currently feel that even the latter danger is outweighed by the danger of not tackling the object-level problems early enough to actually make progress before it’s too late.
This sentence has kind of a confusing structure and I’m having trouble understanding the logic of your argument. Could you rewrite it? Also, part of my thinking, which I’m not sure if you’ve addressed, is that an AGI that fails the Lobian obstacle isn’t just unable to stably self-modify, it’s unable to do even the simplest kind of planning because it can’t predict that its future selves won’t do something crazy. A “successful” (ETA: meaning one that FOOMs) AGI project has to solve this planning/self-prediction problem somehow. Why wouldn’t that solution also apply to the self-modification problem?
Sorry for being confusing, and thanks for giving me a chance to try again! (I did write that comment too quickly due to lack of time.)
So, my point is, I think that there is very little reason to think that evolution somehow had to solve the Löbstacle in order to produce humans. We run into the Löbstacle when we try to use the standard foundations of mathematics (first-order logic + PA or ZFC) in the obvious way to make a self-modifying agent that will continue to follow a given goal after having gone through a very large number of self-modifications. We don’t currently have any framework not subject to this problem, and we need one if we want to build a Friendly seed AI. Evolution didn’t have to solve this problem. It’s true that evolution did have to solve the planning/self-prediction problem, but it didn’t have to solve it with extremely high reliability. I see very little reason to think that if we understood how evolution solved the problem it solved, we would then be really close to having a satisfactory Löbstacle-free decision theory to use in a Friendly seed AI—and thus, conversely, I see little reason to think that an AGI project must solve the Löbstacle in order to solve the planning/prediction problem as well as evolution did.
I can more easily conceive of the possibility (but I think it rather unlikely, too) that solving the Löbstacle is fundamentally necessary to build an agent that can go through millions of rewrites without running out of steam: perhaps without solving the Löbstacle, each rewrite step will have an independent probability of making the machine wirehead (for example), so an AGI doing no better than evolution will almost certainly wirehead during an intelligence explosion. But in this scenario, since evolution build us, an AGI project might build an AI that solves the planning/self-prediction as well as we do, and that AI might then go and solve the Löbstacle and go through a billion self-modifications and take over the world. (The human operators might intervene and un-wirehead it every 50,000 rewrites or so until it’s figured out a solution to the Löbstacle, for example.) So even in this scenario, the Löbstacle doesn’t seem a barrier to AI capability to me; but it is a barrier to FAI, because if it’s the AI that eventually solves the Löbstacle, the superintelligence down the line will have the values of the AI at the time it’s solved the problem. This was what I intended to say by saying that the AGI would “successfully navigate an intelligence explosion—and then paperclip the universe”.
(On the other hand, while I only think of the above as an outside possibility, I think there’s more than an outside possibility that a clean reflective decision theory could be helpful for an AGI project, even if I don’t think it’s a necessary prerequisite. So I’m not entirely unsympathetic to your concerns.)
Does the above help to clarify the argument I had in mind?
So you think that humans do not have a built-in solution to the Löbstacle, and you must also think we are capable of building an FAI that does have a built-in solution to the Löbstacle. That means an intelligence without a solution to the Löbstacle can produce another intelligence that shares its values and does have a solution to the Löbstacle. But then why is it necessary for us to solve this problem? (You said earlier “solving this in some way looks probably necessary for reaching AI safety at all, but only possibly helpful for AI capability.”) Why can’t we instead built an FAI without solving this problem, and depend on the FAI to solve the problem while it’s designing the next generation FAI?
Also earlier you said
I’ve been arguing with Eliezer and Paul about this recently, and thought that I should get the details of your views too. Have you been following the discussions under my most recent post?
Sorry for the long-delayed reply, Wei!
Yup.
I have two main reasons in mind. First, if you are willing to grant that (a) this is a problem that would require humans years of serial research to solve and (b) that it looks much easier to build this into an AI designed from scratch rather than bolting it on to an existing AI design that was created without taking these considerations into account, but you still think that (c) it would be a good plan to have the first-generation FAI solve this problem when building the next-generation FAI, then it seems that you need to assume that the FAI will be much better at AGI design than its human designers before it executes its first self-rewrite, since the human team would by assumption still need years to solve the problem at that point and the plan wouldn’t be particularly helpful if the first-generation FAI would need a similar amount of time or longer. But it seems unlikely to me that we first need to build ultraintelligent machines a la I.J. Good, far surpassing humans, before we can get an intelligence explosion: it seems to me that most of the probability mass should be in the required level of AGI research ability being ⇐ the level of the human research team working on the AGI. I admit that one possible strategy could be to continue having humans improve the initial FAI until it is superintelligent and then ask it to write a successor from scratch, solving the Löbstacle in the process, but it doesn’t seem particularly likely that this is cheaper than solving the problem beforehand.
Second, if we followed this plan, when building the initial FAI we would be unable to use mathematical logic (or other tools sufficiently similar to be subject to the same issues) in a straight-forward way when having it reason about its potential successor. This cuts off a large part of design-space that I’d naturally be looking to. Yes, if we can do it then it’s possible in principle to get an FAI to do it, but mimicking human reasoning doesn’t seem likely to me to be the easiest way to build a safe AGI.
I agree with you that relying on an FAI team to solve a large number of philosophical problems correctly seems dangerous, although I’m sympathetic to Eliezer’s criticism of your outside-view arguments—I essentially agree with your conclusions, but I think I use more inside-view reasoning to arrive at them (would need to think longer to tease this apart). I agree with Paul that something like CEV for philosophy in addition to values should probably part of an FAI design. I agree with you that progress in metaphilosophy would be very valuable, but I do not have any concrete leads to follow. But I think that having good solutions to some of these problems is not unlikely to be helpful for FAI design (and more helpful to FAI than uFAI) so I still think that some amount of work allocated to these philosophical problems looks like a good thing; and I also think that working on these problems does on average reduce the probability of making a bad mistake even if we manage to have the FAI do philosophy itself and have it checked by “coherent extrapolated philosophy”.
You quoted my earlier comment that I think that making object-level progress is important enough that it seems a net positive despite making AGI research more interesting, but I don’t really feel that your post or the discussion below that contains much in the way of arguments about that—could you elaborate on the connection?
(I endorse essentially all of Benja’s reply above.)
Thanks, that’s very helpful. (I meant to write a longer reply but haven’t gotten around to it yet. Didn’t want you to feel ignored in the mean time.)