A simplified version of this proposal that applies more generally is to implement a WBE-based FAI project using AGI, before normal WBE becomes available. This way, you only need to figure out how to build a “yield control to this-here program as it would be after N cycles” AGI, and the rest of the FAI project design can be left to the initial WBE-ed team. This would possibly have a side effect of initially destroying the rest of the human world, since the AGI won’t be guided by our values before the N cycles of internal simulation complete (it would care about simulating the internal environment, not about saving external human lives; it might turn out to be FAI-complete to make it safe throughout), but if that internal environment can be guaranteed to be given control afterwards (when a FAI project inside it is complete), then eventually this is a plausible winning plan.
The general “goal” of this system is to make sure the world is controlled by the decisions of the program produced by the initial program, so the simulation of the initial program and yielding of control to its output are subgoals of that. I don’t see how to make that work, but I don’t see why it can’t be done either. The default problems of AGI instrumental drives get siphoned off into the possible initial destructiveness (but don’t persist later), and the problem of figuring out human values gets delegated to the humans inside the initial program (which is the part that seems potentially much harder to solve pre-WBE than whatever broadly goal-independent decision theory is needed to make this plan well-defined).
This seems to me like the third plausible winning plan, the other two being (1) figuring out and implementing pre-WBE FAI and (2) making sure the WBE shift (which shouldn’t be hardware-limited for the first-runner advantage to be sufficient) is dominated by a FAI project. Unless this somehow turns out to be FAI-complete (which is probable, given that almost any plan is), it seems strictly easier than pre-WBE FAI, although it has a significant cost of possibly initially destroying the current world, which is the problem that the other two plans don’t (by default) have.
“Destroy the world” doesn’t seem to be a big problem to me. Paul’s (proposed) AGI can be viewed as not directly caring about our world, but only about a world/computation defined by H and T (let’s call that HT World). If it can figure out that its preferences for HT World can be best satisfied by it performing actions that (as a side effect) cause it to take over our world, then it seems likely that it can also figure out that it should take over our world in a non-destructive way. I’m more worried about whether (given realistic amounts of initial computing power) it would manage to do anything at all.
I’m not talking about Paul’s proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it’s eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.
Well, yes, but I interpreted the problem of impossibly complicated value definition as the eFAI* (which does seem to be a problem with Paul’s specific proposal, even if we assume that it theoretically converges to a FAI) never coming out of its destructive phase, and hence possibly just eating the universe without producing anything of value, so “destroy the world” is in a sense the sole manifestation of the problem with a hypothetical implementation of that proposal...
[* eFAI = eventually-Friendly AI, let’s coin this term]
Right, but it’s not clear that this is a natural flaw for other possible FAI designs, in a way that it seems to be for this one. Here, we start the AGI without understanding of human values, only the output of the initial program that will be available some time in the future is expected to have that understanding, so there is nothing to morally guide the AGI in the meantime. By “solving FAI” I meant that we do get some technical understanding of human values when the thing is launched, which might be enough to avoid the carnage.
(This whole line of reasoning creates a motivation for thinking about Oracle AI boxing. Here we have AGIs that become FAIs eventually, but might be initially UFAI-level dangerous.)
The general “goal” of this system is to make sure the world is controlled by the decisions of the program produced by the initial program, so the simulation of the initial program and yielding of control to its output are subgoals of that.
My proposal seems like the default way to try and implement that. But I definitely agree that it’s reasonable to think about this aspect of the problem more.
I think it’s useful to separate the problem of pointing the external AGI to the output of a specific program, and the problem of arranging the structure of the initial program so that it produces a desirable output. The structure of the initial program shouldn’t be overengineered, since its role is to perform basic philosophical research that we don’t understand how to do, so the focus there should be mainly on safeguards that promote desirable research dynamics (and prevent UFAI risks inside the initial program).
On the other hand, the way in which AGI uses the output of the initial program (i.e. the notion of preference) has to be understood from the start, this is the decision theory part (you can’t stop the AGI, it will forever optimize according to the output of the initial program, so it should be possible to give its optimization target a definition that expresses human values, even though we might not know at that point what kind of notion human values are an instance of). I don’t think it’s reasonable to force a “real-valued utility function” interpretation or the like on this, it should be much more flexible (for example, it’s not even clear what “worlds” should be optimized).
The approach I was taking was: the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control. This (hopefully) allows the initial program to control the AI’s behavior in detail, by encouraging it to (say) thoroughly replace itself with a different sort of agent. Also, note that the ems can watch the AI reasoning about the ems, when deciding how to administer reward.
I agree that the internal structure and the mechanism for pointing at the output should be thought about largely separately. (Although there are some interactions.)
the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control.
I don’t think this can be right. I expect it’s impossible to create a self-contained abstract map of the world (or human value, of which the world is an aspect), the process of making observations has to be part of the solution.
(But even if we are talking about a “number”, what kind of number is that? Why would something simple like a real number be sufficient to express relevant counterfactual utility values that are to be compared? I don’t know enough to make such assumptions.)
A simplified version of this proposal that applies more generally is to implement a WBE-based FAI project using AGI, before normal WBE becomes available. This way, you only need to figure out how to build a “yield control to this-here program as it would be after N cycles” AGI, and the rest of the FAI project design can be left to the initial WBE-ed team. This would possibly have a side effect of initially destroying the rest of the human world, since the AGI won’t be guided by our values before the N cycles of internal simulation complete (it would care about simulating the internal environment, not about saving external human lives; it might turn out to be FAI-complete to make it safe throughout), but if that internal environment can be guaranteed to be given control afterwards (when a FAI project inside it is complete), then eventually this is a plausible winning plan.
The “yield control” AGI seems problematic to define, much less to imagine existing. Do you think it is plausible?
This is certainly the line of thought that led me here, however.
The general “goal” of this system is to make sure the world is controlled by the decisions of the program produced by the initial program, so the simulation of the initial program and yielding of control to its output are subgoals of that. I don’t see how to make that work, but I don’t see why it can’t be done either. The default problems of AGI instrumental drives get siphoned off into the possible initial destructiveness (but don’t persist later), and the problem of figuring out human values gets delegated to the humans inside the initial program (which is the part that seems potentially much harder to solve pre-WBE than whatever broadly goal-independent decision theory is needed to make this plan well-defined).
This seems to me like the third plausible winning plan, the other two being (1) figuring out and implementing pre-WBE FAI and (2) making sure the WBE shift (which shouldn’t be hardware-limited for the first-runner advantage to be sufficient) is dominated by a FAI project. Unless this somehow turns out to be FAI-complete (which is probable, given that almost any plan is), it seems strictly easier than pre-WBE FAI, although it has a significant cost of possibly initially destroying the current world, which is the problem that the other two plans don’t (by default) have.
“Destroy the world” doesn’t seem to be a big problem to me. Paul’s (proposed) AGI can be viewed as not directly caring about our world, but only about a world/computation defined by H and T (let’s call that HT World). If it can figure out that its preferences for HT World can be best satisfied by it performing actions that (as a side effect) cause it to take over our world, then it seems likely that it can also figure out that it should take over our world in a non-destructive way. I’m more worried about whether (given realistic amounts of initial computing power) it would manage to do anything at all.
I’m not talking about Paul’s proposal in particular, but about eventually-Friendly AIs in general. Their defining feature is that they have correct Friendly goal given by a complicated definition that leaves a lot of logical uncertainty about the goal until it’s eventually made more explicit. So we might explore the neighborhood of normal FAIs, increasing the initial logical uncertainty about their goal, so that they become more and more prone to initial pursuit of generic instrumental gains at the expense of what they eventually realize to be their values.
Oh, please reinterpret my comment as replying to this comment of yours. (That one is specifically talking about Paul’s proposal, right?)
Well, yes, but I interpreted the problem of impossibly complicated value definition as the eFAI* (which does seem to be a problem with Paul’s specific proposal, even if we assume that it theoretically converges to a FAI) never coming out of its destructive phase, and hence possibly just eating the universe without producing anything of value, so “destroy the world” is in a sense the sole manifestation of the problem with a hypothetical implementation of that proposal...
[* eFAI = eventually-Friendly AI, let’s coin this term]
Pre-WBE FAI can initially destroy the world too, if its utility function specification is as complex as CEV for example.
Right, but it’s not clear that this is a natural flaw for other possible FAI designs, in a way that it seems to be for this one. Here, we start the AGI without understanding of human values, only the output of the initial program that will be available some time in the future is expected to have that understanding, so there is nothing to morally guide the AGI in the meantime. By “solving FAI” I meant that we do get some technical understanding of human values when the thing is launched, which might be enough to avoid the carnage.
(This whole line of reasoning creates a motivation for thinking about Oracle AI boxing. Here we have AGIs that become FAIs eventually, but might be initially UFAI-level dangerous.)
My proposal seems like the default way to try and implement that. But I definitely agree that it’s reasonable to think about this aspect of the problem more.
I think it’s useful to separate the problem of pointing the external AGI to the output of a specific program, and the problem of arranging the structure of the initial program so that it produces a desirable output. The structure of the initial program shouldn’t be overengineered, since its role is to perform basic philosophical research that we don’t understand how to do, so the focus there should be mainly on safeguards that promote desirable research dynamics (and prevent UFAI risks inside the initial program).
On the other hand, the way in which AGI uses the output of the initial program (i.e. the notion of preference) has to be understood from the start, this is the decision theory part (you can’t stop the AGI, it will forever optimize according to the output of the initial program, so it should be possible to give its optimization target a definition that expresses human values, even though we might not know at that point what kind of notion human values are an instance of). I don’t think it’s reasonable to force a “real-valued utility function” interpretation or the like on this, it should be much more flexible (for example, it’s not even clear what “worlds” should be optimized).
The approach I was taking was: the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control. This (hopefully) allows the initial program to control the AI’s behavior in detail, by encouraging it to (say) thoroughly replace itself with a different sort of agent. Also, note that the ems can watch the AI reasoning about the ems, when deciding how to administer reward.
I agree that the internal structure and the mechanism for pointing at the output should be thought about largely separately. (Although there are some interactions.)
I don’t think this can be right. I expect it’s impossible to create a self-contained abstract map of the world (or human value, of which the world is an aspect), the process of making observations has to be part of the solution.
(But even if we are talking about a “number”, what kind of number is that? Why would something simple like a real number be sufficient to express relevant counterfactual utility values that are to be compared? I don’t know enough to make such assumptions.)