The general “goal” of this system is to make sure the world is controlled by the decisions of the program produced by the initial program, so the simulation of the initial program and yielding of control to its output are subgoals of that.
My proposal seems like the default way to try and implement that. But I definitely agree that it’s reasonable to think about this aspect of the problem more.
I think it’s useful to separate the problem of pointing the external AGI to the output of a specific program, and the problem of arranging the structure of the initial program so that it produces a desirable output. The structure of the initial program shouldn’t be overengineered, since its role is to perform basic philosophical research that we don’t understand how to do, so the focus there should be mainly on safeguards that promote desirable research dynamics (and prevent UFAI risks inside the initial program).
On the other hand, the way in which AGI uses the output of the initial program (i.e. the notion of preference) has to be understood from the start, this is the decision theory part (you can’t stop the AGI, it will forever optimize according to the output of the initial program, so it should be possible to give its optimization target a definition that expresses human values, even though we might not know at that point what kind of notion human values are an instance of). I don’t think it’s reasonable to force a “real-valued utility function” interpretation or the like on this, it should be much more flexible (for example, it’s not even clear what “worlds” should be optimized).
The approach I was taking was: the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control. This (hopefully) allows the initial program to control the AI’s behavior in detail, by encouraging it to (say) thoroughly replace itself with a different sort of agent. Also, note that the ems can watch the AI reasoning about the ems, when deciding how to administer reward.
I agree that the internal structure and the mechanism for pointing at the output should be thought about largely separately. (Although there are some interactions.)
the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control.
I don’t think this can be right. I expect it’s impossible to create a self-contained abstract map of the world (or human value, of which the world is an aspect), the process of making observations has to be part of the solution.
(But even if we are talking about a “number”, what kind of number is that? Why would something simple like a real number be sufficient to express relevant counterfactual utility values that are to be compared? I don’t know enough to make such assumptions.)
My proposal seems like the default way to try and implement that. But I definitely agree that it’s reasonable to think about this aspect of the problem more.
I think it’s useful to separate the problem of pointing the external AGI to the output of a specific program, and the problem of arranging the structure of the initial program so that it produces a desirable output. The structure of the initial program shouldn’t be overengineered, since its role is to perform basic philosophical research that we don’t understand how to do, so the focus there should be mainly on safeguards that promote desirable research dynamics (and prevent UFAI risks inside the initial program).
On the other hand, the way in which AGI uses the output of the initial program (i.e. the notion of preference) has to be understood from the start, this is the decision theory part (you can’t stop the AGI, it will forever optimize according to the output of the initial program, so it should be possible to give its optimization target a definition that expresses human values, even though we might not know at that point what kind of notion human values are an instance of). I don’t think it’s reasonable to force a “real-valued utility function” interpretation or the like on this, it should be much more flexible (for example, it’s not even clear what “worlds” should be optimized).
The approach I was taking was: the initial program does whatever it likes (including in particular simulation of the original AI), and then outputs a number, which the AI aims to control. This (hopefully) allows the initial program to control the AI’s behavior in detail, by encouraging it to (say) thoroughly replace itself with a different sort of agent. Also, note that the ems can watch the AI reasoning about the ems, when deciding how to administer reward.
I agree that the internal structure and the mechanism for pointing at the output should be thought about largely separately. (Although there are some interactions.)
I don’t think this can be right. I expect it’s impossible to create a self-contained abstract map of the world (or human value, of which the world is an aspect), the process of making observations has to be part of the solution.
(But even if we are talking about a “number”, what kind of number is that? Why would something simple like a real number be sufficient to express relevant counterfactual utility values that are to be compared? I don’t know enough to make such assumptions.)