(Note that the hypothetical process probably doesn’t even output a goal specification, it just outputs a number, which the AI tries to control.)
That’s certainly a nice answer to the question “what’s the domain of the probability distribution and the utility function?” You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that’s easy to maximize. Do you think that’s unlikely to happen?
ETA: on further thought, this seems to be a pretty strong argument against the whole “indirect normativity” idea, regardless of what kind of object the hypothetical human is supposed to output.
The outer AGI doesn’t control the initial program if the initial program doesn’t listen to the outer AGI. It’s a kind of reverse AI box problem: the program that the AGI runs shouldn’t let the AGI in. This certainly argues that the initial program should take no input, and output its result blindly. That it shouldn’t run the outer AGI internally is then the same kind of AI safety consideration as that it shouldn’t run any other UFAI internally, so it doesn’t seem like an additional problem.
Of course, once you are powerful enough you let the AGI in (or you define a utility function which invokes the AI, which is really no difference), because this is how you control it.
I don’t understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don’t “let the AGI in”. You are living in its goal definition, and your decisions determine AGI’s values.
Are you perhaps referring to the idea that AGI’s actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What’s the difference between the simulated humans outputting a utility function U’ which the outer AGI will then try to maximize, and the simulated humans just running U’ and the outer AGI trying to maximize the value returned by the whole simulation (and hence U’)? If case of the latter, you’re “letting the AGI in” by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U’.
OK, I see what Paul probably meant. Let’s say “utility value”, not “utility function”, since that’s what we mean. I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program. As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons). There is a map/territory distinction between the definition of utility value prepared by the initial program and the utility value itself optimized by the outer AGI.
(Also, “utility function” might be confusing especially for outsiders who are used to “utility function” meaning a mapping from world states to utility values, whereas Paul is using it to mean a parameterless computation that returns a utility value.)
I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program.
I think Paul is thinking that the utility definition that the simulated humans come up with is not necessarily a definition of our actual values, but just something that causes the outer AGI to self-modify into an FAI, and for that purpose it might be enough to define it using a programming language.
As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons).
I think Paul’s intuition here is that the simulated humans (or enhanced humans and/or FAIs they build inside the simulation) may find it useful to “blur the lines”. In other words, the distinction you draw is not a fundamental one but just a safety heuristic that the simulated researchers may decide to discard or modify once they become “powerful enough”. For example they may decide to partially simulate the outer AGI or otherwise try to reason about what it might do given various definitions of U’ the simulation might ultimately decide upon, once they understand enough theory to see how to do this in a safe way.
That’s certainly a nice answer to the question “what’s the domain of the probability distribution and the utility function?” You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that’s easy to maximize. Do you think that’s unlikely to happen?
ETA: on further thought, this seems to be a pretty strong argument against the whole “indirect normativity” idea, regardless of what kind of object the hypothetical human is supposed to output.
The outer AGI doesn’t control the initial program if the initial program doesn’t listen to the outer AGI. It’s a kind of reverse AI box problem: the program that the AGI runs shouldn’t let the AGI in. This certainly argues that the initial program should take no input, and output its result blindly. That it shouldn’t run the outer AGI internally is then the same kind of AI safety consideration as that it shouldn’t run any other UFAI internally, so it doesn’t seem like an additional problem.
Of course, once you are powerful enough you let the AGI in (or you define a utility function which invokes the AI, which is really no difference), because this is how you control it.
I don’t understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don’t “let the AGI in”. You are living in its goal definition, and your decisions determine AGI’s values.
Are you perhaps referring to the idea that AGI’s actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What’s the difference between the simulated humans outputting a utility function U’ which the outer AGI will then try to maximize, and the simulated humans just running U’ and the outer AGI trying to maximize the value returned by the whole simulation (and hence U’)? If case of the latter, you’re “letting the AGI in” by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U’.
OK, I see what Paul probably meant. Let’s say “utility value”, not “utility function”, since that’s what we mean. I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program. As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons). There is a map/territory distinction between the definition of utility value prepared by the initial program and the utility value itself optimized by the outer AGI.
(Also, “utility function” might be confusing especially for outsiders who are used to “utility function” meaning a mapping from world states to utility values, whereas Paul is using it to mean a parameterless computation that returns a utility value.)
I think Paul is thinking that the utility definition that the simulated humans come up with is not necessarily a definition of our actual values, but just something that causes the outer AGI to self-modify into an FAI, and for that purpose it might be enough to define it using a programming language.
I think Paul’s intuition here is that the simulated humans (or enhanced humans and/or FAIs they build inside the simulation) may find it useful to “blur the lines”. In other words, the distinction you draw is not a fundamental one but just a safety heuristic that the simulated researchers may decide to discard or modify once they become “powerful enough”. For example they may decide to partially simulate the outer AGI or otherwise try to reason about what it might do given various definitions of U’ the simulation might ultimately decide upon, once they understand enough theory to see how to do this in a safe way.
Good point. Thanks.