(Note that the hypothetical process probably doesn’t even output a goal specification, it just outputs a number, which the AI tries to control.)
One potential worry is that the human subject must be over some minimal threshold of intelligence for this scheme to work. A village fool would fail. How do I convince myself that the threshold is below the “reasonably intelligent human” level?
The hope is something like: “We can reason about the outputs of this process, so an AI as smart as us can reason about the outputs of this process (perhaps by realizing it can watch or ask us to learn about the behavior of the definition).” The bar the AI has to meet, is to realize basically what is going on the definition. This assumes of course not only that the process actually works, but that it works for the reasons we believe it works.
I have doubts about this, and it seems generally important to think about whether particular models of UDT could make these inferences. The sketchy part seems to be the AI looking out at the world, and drawing mathematical inferences from the assumption that it’s environment is a draw from a universal distribution. There are two ways you could imagine this going wrong:
Should / will the AI assume it’s environment is universally distributed? It seems like we’ll need something like this, so that the AI even has an understanding of how to build infrastructure, but when U is complicated it isn’t clear.
1b. Can the AI even reason about where it’s utility function came from, in order to draw inferences about it’s mathematical properties? (this seems likely, given 1)
If the AI actually assumes it’s environment is universally distributed, then it is living in a simulation, and maybe the simulator has some perverse motives and so the AI should draw very different conclusions from it’s environment.
I expect UDT properly formulated avoids both issues, and if it doesn’t we are going to have to take these things on anyway. But it needs some thought.
Should/will the AI assume it’s environment is universally distributed?
Maybe the AI could also ask the hypothetical human upload for the right universal distribution to use? Such an AI probably wouldn’t work out of the box, but there could be some sort of bootstrapping scheme...
Maybe the AI could also ask the hypothetical human upload for the right universal distribution to use?
Certainly, but it’s not even clear what that distribution should be over, and whether considering probability distributions at all is the right thing to do. The initial AGI needs some operating principles, but these principles should allow fundamental correction, as the research on the inside proceeds (for which they have to be good enough initially).
(Note that the hypothetical process probably doesn’t even output a goal specification, it just outputs a number, which the AI tries to control.)
That’s certainly a nice answer to the question “what’s the domain of the probability distribution and the utility function?” You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that’s easy to maximize. Do you think that’s unlikely to happen?
ETA: on further thought, this seems to be a pretty strong argument against the whole “indirect normativity” idea, regardless of what kind of object the hypothetical human is supposed to output.
The outer AGI doesn’t control the initial program if the initial program doesn’t listen to the outer AGI. It’s a kind of reverse AI box problem: the program that the AGI runs shouldn’t let the AGI in. This certainly argues that the initial program should take no input, and output its result blindly. That it shouldn’t run the outer AGI internally is then the same kind of AI safety consideration as that it shouldn’t run any other UFAI internally, so it doesn’t seem like an additional problem.
Of course, once you are powerful enough you let the AGI in (or you define a utility function which invokes the AI, which is really no difference), because this is how you control it.
I don’t understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don’t “let the AGI in”. You are living in its goal definition, and your decisions determine AGI’s values.
Are you perhaps referring to the idea that AGI’s actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What’s the difference between the simulated humans outputting a utility function U’ which the outer AGI will then try to maximize, and the simulated humans just running U’ and the outer AGI trying to maximize the value returned by the whole simulation (and hence U’)? If case of the latter, you’re “letting the AGI in” by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U’.
OK, I see what Paul probably meant. Let’s say “utility value”, not “utility function”, since that’s what we mean. I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program. As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons). There is a map/territory distinction between the definition of utility value prepared by the initial program and the utility value itself optimized by the outer AGI.
(Also, “utility function” might be confusing especially for outsiders who are used to “utility function” meaning a mapping from world states to utility values, whereas Paul is using it to mean a parameterless computation that returns a utility value.)
I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program.
I think Paul is thinking that the utility definition that the simulated humans come up with is not necessarily a definition of our actual values, but just something that causes the outer AGI to self-modify into an FAI, and for that purpose it might be enough to define it using a programming language.
As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons).
I think Paul’s intuition here is that the simulated humans (or enhanced humans and/or FAIs they build inside the simulation) may find it useful to “blur the lines”. In other words, the distinction you draw is not a fundamental one but just a safety heuristic that the simulated researchers may decide to discard or modify once they become “powerful enough”. For example they may decide to partially simulate the outer AGI or otherwise try to reason about what it might do given various definitions of U’ the simulation might ultimately decide upon, once they understand enough theory to see how to do this in a safe way.
(Note that the hypothetical process probably doesn’t even output a goal specification, it just outputs a number, which the AI tries to control.)
The hope is something like: “We can reason about the outputs of this process, so an AI as smart as us can reason about the outputs of this process (perhaps by realizing it can watch or ask us to learn about the behavior of the definition).” The bar the AI has to meet, is to realize basically what is going on the definition. This assumes of course not only that the process actually works, but that it works for the reasons we believe it works.
I have doubts about this, and it seems generally important to think about whether particular models of UDT could make these inferences. The sketchy part seems to be the AI looking out at the world, and drawing mathematical inferences from the assumption that it’s environment is a draw from a universal distribution. There are two ways you could imagine this going wrong:
Should / will the AI assume it’s environment is universally distributed? It seems like we’ll need something like this, so that the AI even has an understanding of how to build infrastructure, but when U is complicated it isn’t clear. 1b. Can the AI even reason about where it’s utility function came from, in order to draw inferences about it’s mathematical properties? (this seems likely, given 1)
If the AI actually assumes it’s environment is universally distributed, then it is living in a simulation, and maybe the simulator has some perverse motives and so the AI should draw very different conclusions from it’s environment.
I expect UDT properly formulated avoids both issues, and if it doesn’t we are going to have to take these things on anyway. But it needs some thought.
Maybe the AI could also ask the hypothetical human upload for the right universal distribution to use? Such an AI probably wouldn’t work out of the box, but there could be some sort of bootstrapping scheme...
Certainly, but it’s not even clear what that distribution should be over, and whether considering probability distributions at all is the right thing to do. The initial AGI needs some operating principles, but these principles should allow fundamental correction, as the research on the inside proceeds (for which they have to be good enough initially).
That’s certainly a nice answer to the question “what’s the domain of the probability distribution and the utility function?” You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that’s easy to maximize. Do you think that’s unlikely to happen?
ETA: on further thought, this seems to be a pretty strong argument against the whole “indirect normativity” idea, regardless of what kind of object the hypothetical human is supposed to output.
The outer AGI doesn’t control the initial program if the initial program doesn’t listen to the outer AGI. It’s a kind of reverse AI box problem: the program that the AGI runs shouldn’t let the AGI in. This certainly argues that the initial program should take no input, and output its result blindly. That it shouldn’t run the outer AGI internally is then the same kind of AI safety consideration as that it shouldn’t run any other UFAI internally, so it doesn’t seem like an additional problem.
Of course, once you are powerful enough you let the AGI in (or you define a utility function which invokes the AI, which is really no difference), because this is how you control it.
I don’t understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don’t “let the AGI in”. You are living in its goal definition, and your decisions determine AGI’s values.
Are you perhaps referring to the idea that AGI’s actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What’s the difference between the simulated humans outputting a utility function U’ which the outer AGI will then try to maximize, and the simulated humans just running U’ and the outer AGI trying to maximize the value returned by the whole simulation (and hence U’)? If case of the latter, you’re “letting the AGI in” by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U’.
OK, I see what Paul probably meant. Let’s say “utility value”, not “utility function”, since that’s what we mean. I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program. As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons). There is a map/territory distinction between the definition of utility value prepared by the initial program and the utility value itself optimized by the outer AGI.
(Also, “utility function” might be confusing especially for outsiders who are used to “utility function” meaning a mapping from world states to utility values, whereas Paul is using it to mean a parameterless computation that returns a utility value.)
I think Paul is thinking that the utility definition that the simulated humans come up with is not necessarily a definition of our actual values, but just something that causes the outer AGI to self-modify into an FAI, and for that purpose it might be enough to define it using a programming language.
I think Paul’s intuition here is that the simulated humans (or enhanced humans and/or FAIs they build inside the simulation) may find it useful to “blur the lines”. In other words, the distinction you draw is not a fundamental one but just a safety heuristic that the simulated researchers may decide to discard or modify once they become “powerful enough”. For example they may decide to partially simulate the outer AGI or otherwise try to reason about what it might do given various definitions of U’ the simulation might ultimately decide upon, once they understand enough theory to see how to do this in a safe way.
Good point. Thanks.