AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem
Actually, this isn’t entirely an open problem. If the environment is known or mostly known, we can easily define a model of the environment and define a utility function in terms of that model. The problem is that when we expect an AI to build a model of the environment from scratch, we don’t have the model ahead of time to use in the definition of our utility function. We do know what the AI’s measurements will look like since we define what inputs it gets, so we can define a utility function in terms of those. That is when we get the problem where we have no way of making it care about things that it is not directly measuring.
Is this a bug or a feature?
It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it’s tasked with, and the boundaries of that region, and then don’t have it worry about the rest. (This is my reading of Stuart_Armstrong’s idea; the Master AI’s job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
The problem is that when we expect an AI to build a model of the environment from scratch
Is this a wise approach? What does “scratch” mean?
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
Is this a wise approach? What does “scratch” mean?
What I had in mind while writing this was Solomonoff induction. If the AI’s model of the universe could be any computable program, it is hard to detect even a paperclip (impossible in full generality due to Rice’s theorem). On LW, the phrase ‘ontological crisis’ is used to refer to the problem of translating a utility function described in terms of one model of the universe into something that can be use in a different, presumably more accurate, model of the universe. The transition from classical physics to quantum mechanics is an illustrative example; why should or shouldn’t our decisions under many worlds be approximately the same as they would be in a classical universe?
As for whether this is a good idea, it seems much harder, if even possible, to build an AI that doesn’t need to navigate such transitions as it is to build one that can do so.
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
This still seems very dangerous. If there is a boundary beyond which it has no incentive to preserve anything, I think that at least some things outside of that boundary get destroyed by default. Concretely, what if the AI creates self-replicating nanobots and has some system within its region to prevent them from replicating uncontrollably, but there is no such protection in place in the rest of the universe?
Actually, this isn’t entirely an open problem. If the environment is known or mostly known, we can easily define a model of the environment and define a utility function in terms of that model. The problem is that when we expect an AI to build a model of the environment from scratch, we don’t have the model ahead of time to use in the definition of our utility function. We do know what the AI’s measurements will look like since we define what inputs it gets, so we can define a utility function in terms of those. That is when we get the problem where we have no way of making it care about things that it is not directly measuring.
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
Is this a wise approach? What does “scratch” mean?
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
What I had in mind while writing this was Solomonoff induction. If the AI’s model of the universe could be any computable program, it is hard to detect even a paperclip (impossible in full generality due to Rice’s theorem). On LW, the phrase ‘ontological crisis’ is used to refer to the problem of translating a utility function described in terms of one model of the universe into something that can be use in a different, presumably more accurate, model of the universe. The transition from classical physics to quantum mechanics is an illustrative example; why should or shouldn’t our decisions under many worlds be approximately the same as they would be in a classical universe?
As for whether this is a good idea, it seems much harder, if even possible, to build an AI that doesn’t need to navigate such transitions as it is to build one that can do so.
This still seems very dangerous. If there is a boundary beyond which it has no incentive to preserve anything, I think that at least some things outside of that boundary get destroyed by default. Concretely, what if the AI creates self-replicating nanobots and has some system within its region to prevent them from replicating uncontrollably, but there is no such protection in place in the rest of the universe?