Cool! Thanks to you, we finally seem to have a viable attack on the problem of FAI, by defining goals in terms of hypothetical processes that could output a goal specification, like brain emulations with powerful computers. Everyone please help advance this direction of inquiry :-)
One potential worry is that the human subject must be over some minimal threshold of intelligence for this scheme to work. A village fool would fail. How do I convince myself that the threshold is below the “reasonably intelligent human” level?
I disagree; reading Paul’s description made it clear to me how superficial it is to want to solve a problem by creating an army of uploads to do it for you. You may as well just try to solve the problem here and now, rather than hoping to outsource it to a bunch of nonexistent human-simulations running on nonexistent hardware. The only reason to consider such a baroque way of solving a problem is if you expect to be very pressed for time and yet to also have access to superdupercomputing power. You know, the world is hurtling towards singularity, no-one has crossed the finish line but many people are getting close, your FAI research organization manages to get a hold of a few petaflops on which to run a truncated AIXI problem-solver… and now you can finally go dig up that scrap of paper on which your team wrote down, years before, the perfectly optimal wish: “I want you, FAI-precursor, to do what the ethically stabilized members of our team would do, if they had hundreds of years to think about it, and if they...”, etcetera.
It’s a logically possible scenario, but is it remotely likely? This absolutely should not be the paradigm for a successful implementation of FAI or CEV. It’s just a wacky contingency that you might want to spend a little time thinking about. The plan should be that un-uploaded people will figure out what to do. They will surely make intensive use of computers, and there may be some big final calculation in which the schematics of human genetic, neural and cultural architecture are the inputs to a reflective optimization process; but you shouldn’t imagine that, like some bunch of Greg Egan characters, the researchers are going to successfully upload themselves and then figure out the logistics and the mathematics of a successful CEV process. It’s like deciding to fix global warming by building a city on the moon that will be devoted to the task of solving global warming.
The plan doesn’t require a truncated AIXI-like solver with lots of hardware. It’s a goal specification you can code directly into a self-improving AI that starts out with weak hardware. “Follow the utility function that program X would output if given enough time” doesn’t require the AI to run program X, only to reason about the likely outputs of program X.
Follow the utility function that program X would output if given enough time” doesn’t require the AI to run program X, only to reason about the likely outputs of program X.
It doesn’t in principle require this, but might in practice, in which case the AI might eat the universe if that’s the amount of computational resources necessary to compute the results of running program X. That is a potential downside of this plan.
Well on the dark, sardonic upside, it might find it convenient to eat the people in the process of using their minds to compute a CEV-function. Infinite varieties of infinite hell-eternities for everyone!
(Note that the hypothetical process probably doesn’t even output a goal specification, it just outputs a number, which the AI tries to control.)
One potential worry is that the human subject must be over some minimal threshold of intelligence for this scheme to work. A village fool would fail. How do I convince myself that the threshold is below the “reasonably intelligent human” level?
The hope is something like: “We can reason about the outputs of this process, so an AI as smart as us can reason about the outputs of this process (perhaps by realizing it can watch or ask us to learn about the behavior of the definition).” The bar the AI has to meet, is to realize basically what is going on the definition. This assumes of course not only that the process actually works, but that it works for the reasons we believe it works.
I have doubts about this, and it seems generally important to think about whether particular models of UDT could make these inferences. The sketchy part seems to be the AI looking out at the world, and drawing mathematical inferences from the assumption that it’s environment is a draw from a universal distribution. There are two ways you could imagine this going wrong:
Should / will the AI assume it’s environment is universally distributed? It seems like we’ll need something like this, so that the AI even has an understanding of how to build infrastructure, but when U is complicated it isn’t clear.
1b. Can the AI even reason about where it’s utility function came from, in order to draw inferences about it’s mathematical properties? (this seems likely, given 1)
If the AI actually assumes it’s environment is universally distributed, then it is living in a simulation, and maybe the simulator has some perverse motives and so the AI should draw very different conclusions from it’s environment.
I expect UDT properly formulated avoids both issues, and if it doesn’t we are going to have to take these things on anyway. But it needs some thought.
Should/will the AI assume it’s environment is universally distributed?
Maybe the AI could also ask the hypothetical human upload for the right universal distribution to use? Such an AI probably wouldn’t work out of the box, but there could be some sort of bootstrapping scheme...
Maybe the AI could also ask the hypothetical human upload for the right universal distribution to use?
Certainly, but it’s not even clear what that distribution should be over, and whether considering probability distributions at all is the right thing to do. The initial AGI needs some operating principles, but these principles should allow fundamental correction, as the research on the inside proceeds (for which they have to be good enough initially).
(Note that the hypothetical process probably doesn’t even output a goal specification, it just outputs a number, which the AI tries to control.)
That’s certainly a nice answer to the question “what’s the domain of the probability distribution and the utility function?” You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that’s easy to maximize. Do you think that’s unlikely to happen?
ETA: on further thought, this seems to be a pretty strong argument against the whole “indirect normativity” idea, regardless of what kind of object the hypothetical human is supposed to output.
The outer AGI doesn’t control the initial program if the initial program doesn’t listen to the outer AGI. It’s a kind of reverse AI box problem: the program that the AGI runs shouldn’t let the AGI in. This certainly argues that the initial program should take no input, and output its result blindly. That it shouldn’t run the outer AGI internally is then the same kind of AI safety consideration as that it shouldn’t run any other UFAI internally, so it doesn’t seem like an additional problem.
Of course, once you are powerful enough you let the AGI in (or you define a utility function which invokes the AI, which is really no difference), because this is how you control it.
I don’t understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don’t “let the AGI in”. You are living in its goal definition, and your decisions determine AGI’s values.
Are you perhaps referring to the idea that AGI’s actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What’s the difference between the simulated humans outputting a utility function U’ which the outer AGI will then try to maximize, and the simulated humans just running U’ and the outer AGI trying to maximize the value returned by the whole simulation (and hence U’)? If case of the latter, you’re “letting the AGI in” by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U’.
OK, I see what Paul probably meant. Let’s say “utility value”, not “utility function”, since that’s what we mean. I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program. As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons). There is a map/territory distinction between the definition of utility value prepared by the initial program and the utility value itself optimized by the outer AGI.
(Also, “utility function” might be confusing especially for outsiders who are used to “utility function” meaning a mapping from world states to utility values, whereas Paul is using it to mean a parameterless computation that returns a utility value.)
I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program.
I think Paul is thinking that the utility definition that the simulated humans come up with is not necessarily a definition of our actual values, but just something that causes the outer AGI to self-modify into an FAI, and for that purpose it might be enough to define it using a programming language.
As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons).
I think Paul’s intuition here is that the simulated humans (or enhanced humans and/or FAIs they build inside the simulation) may find it useful to “blur the lines”. In other words, the distinction you draw is not a fundamental one but just a safety heuristic that the simulated researchers may decide to discard or modify once they become “powerful enough”. For example they may decide to partially simulate the outer AGI or otherwise try to reason about what it might do given various definitions of U’ the simulation might ultimately decide upon, once they understand enough theory to see how to do this in a safe way.
Cool! Thanks to you, we finally seem to have a viable attack on the problem of FAI, by defining goals in terms of hypothetical processes that could output a goal specification, like brain emulations with powerful computers. Everyone please help advance this direction of inquiry :-)
One potential worry is that the human subject must be over some minimal threshold of intelligence for this scheme to work. A village fool would fail. How do I convince myself that the threshold is below the “reasonably intelligent human” level?
I disagree; reading Paul’s description made it clear to me how superficial it is to want to solve a problem by creating an army of uploads to do it for you. You may as well just try to solve the problem here and now, rather than hoping to outsource it to a bunch of nonexistent human-simulations running on nonexistent hardware. The only reason to consider such a baroque way of solving a problem is if you expect to be very pressed for time and yet to also have access to superdupercomputing power. You know, the world is hurtling towards singularity, no-one has crossed the finish line but many people are getting close, your FAI research organization manages to get a hold of a few petaflops on which to run a truncated AIXI problem-solver… and now you can finally go dig up that scrap of paper on which your team wrote down, years before, the perfectly optimal wish: “I want you, FAI-precursor, to do what the ethically stabilized members of our team would do, if they had hundreds of years to think about it, and if they...”, etcetera.
It’s a logically possible scenario, but is it remotely likely? This absolutely should not be the paradigm for a successful implementation of FAI or CEV. It’s just a wacky contingency that you might want to spend a little time thinking about. The plan should be that un-uploaded people will figure out what to do. They will surely make intensive use of computers, and there may be some big final calculation in which the schematics of human genetic, neural and cultural architecture are the inputs to a reflective optimization process; but you shouldn’t imagine that, like some bunch of Greg Egan characters, the researchers are going to successfully upload themselves and then figure out the logistics and the mathematics of a successful CEV process. It’s like deciding to fix global warming by building a city on the moon that will be devoted to the task of solving global warming.
The plan doesn’t require a truncated AIXI-like solver with lots of hardware. It’s a goal specification you can code directly into a self-improving AI that starts out with weak hardware. “Follow the utility function that program X would output if given enough time” doesn’t require the AI to run program X, only to reason about the likely outputs of program X.
It doesn’t in principle require this, but might in practice, in which case the AI might eat the universe if that’s the amount of computational resources necessary to compute the results of running program X. That is a potential downside of this plan.
Well on the dark, sardonic upside, it might find it convenient to eat the people in the process of using their minds to compute a CEV-function. Infinite varieties of infinite hell-eternities for everyone!
Could you express your objection more precisely than “it’s wacky”?
(Note that the hypothetical process probably doesn’t even output a goal specification, it just outputs a number, which the AI tries to control.)
The hope is something like: “We can reason about the outputs of this process, so an AI as smart as us can reason about the outputs of this process (perhaps by realizing it can watch or ask us to learn about the behavior of the definition).” The bar the AI has to meet, is to realize basically what is going on the definition. This assumes of course not only that the process actually works, but that it works for the reasons we believe it works.
I have doubts about this, and it seems generally important to think about whether particular models of UDT could make these inferences. The sketchy part seems to be the AI looking out at the world, and drawing mathematical inferences from the assumption that it’s environment is a draw from a universal distribution. There are two ways you could imagine this going wrong:
Should / will the AI assume it’s environment is universally distributed? It seems like we’ll need something like this, so that the AI even has an understanding of how to build infrastructure, but when U is complicated it isn’t clear. 1b. Can the AI even reason about where it’s utility function came from, in order to draw inferences about it’s mathematical properties? (this seems likely, given 1)
If the AI actually assumes it’s environment is universally distributed, then it is living in a simulation, and maybe the simulator has some perverse motives and so the AI should draw very different conclusions from it’s environment.
I expect UDT properly formulated avoids both issues, and if it doesn’t we are going to have to take these things on anyway. But it needs some thought.
Maybe the AI could also ask the hypothetical human upload for the right universal distribution to use? Such an AI probably wouldn’t work out of the box, but there could be some sort of bootstrapping scheme...
Certainly, but it’s not even clear what that distribution should be over, and whether considering probability distributions at all is the right thing to do. The initial AGI needs some operating principles, but these principles should allow fundamental correction, as the research on the inside proceeds (for which they have to be good enough initially).
That’s certainly a nice answer to the question “what’s the domain of the probability distribution and the utility function?” You just say that the utility function is a parameterless definition of a single number. But that seems to lead to the danger of the AI using acausal control to make the hypothetical human output a definition that’s easy to maximize. Do you think that’s unlikely to happen?
ETA: on further thought, this seems to be a pretty strong argument against the whole “indirect normativity” idea, regardless of what kind of object the hypothetical human is supposed to output.
The outer AGI doesn’t control the initial program if the initial program doesn’t listen to the outer AGI. It’s a kind of reverse AI box problem: the program that the AGI runs shouldn’t let the AGI in. This certainly argues that the initial program should take no input, and output its result blindly. That it shouldn’t run the outer AGI internally is then the same kind of AI safety consideration as that it shouldn’t run any other UFAI internally, so it doesn’t seem like an additional problem.
Of course, once you are powerful enough you let the AGI in (or you define a utility function which invokes the AI, which is really no difference), because this is how you control it.
I don’t understand your comment. [Edit: I probably do now.] You output something that the outer AGI uses to optimize the world as you intend, you don’t “let the AGI in”. You are living in its goal definition, and your decisions determine AGI’s values.
Are you perhaps referring to the idea that AGI’s actions control its goal state? But you are not its goal state, you are a principle that determines its goal state, just as the AGI is. You show the AGI where to find its goal state, and the AGI starts working on optimizing it.
What’s the difference between the simulated humans outputting a utility function U’ which the outer AGI will then try to maximize, and the simulated humans just running U’ and the outer AGI trying to maximize the value returned by the whole simulation (and hence U’)? If case of the latter, you’re “letting the AGI in” by including its definition (explicitly or implicitly via something like the universal prior) in the definition of U’.
OK, I see what Paul probably meant. Let’s say “utility value”, not “utility function”, since that’s what we mean. I don’t think we should be talking about “running utility value”, because utility might be something given by an abstract definition, not state of execution of any program. As I discussed in the grandparent, the distinction I’m making is between the outer AGI controlling utility value (which it does) and outer AGI controlling the simulated researchers that prepare the definition of utility value (which it shouldn’t be allowed to for AI safety reasons). There is a map/territory distinction between the definition of utility value prepared by the initial program and the utility value itself optimized by the outer AGI.
(Also, “utility function” might be confusing especially for outsiders who are used to “utility function” meaning a mapping from world states to utility values, whereas Paul is using it to mean a parameterless computation that returns a utility value.)
I think Paul is thinking that the utility definition that the simulated humans come up with is not necessarily a definition of our actual values, but just something that causes the outer AGI to self-modify into an FAI, and for that purpose it might be enough to define it using a programming language.
I think Paul’s intuition here is that the simulated humans (or enhanced humans and/or FAIs they build inside the simulation) may find it useful to “blur the lines”. In other words, the distinction you draw is not a fundamental one but just a safety heuristic that the simulated researchers may decide to discard or modify once they become “powerful enough”. For example they may decide to partially simulate the outer AGI or otherwise try to reason about what it might do given various definitions of U’ the simulation might ultimately decide upon, once they understand enough theory to see how to do this in a safe way.
Good point. Thanks.