How do you plan to formalize negentropy spent on this goal?
I’m going to guess this is an easier problem than conquering the universe.
If you measure the total negentropy in the universe
Could you? The universe is pretty big.
The approach I would try would depend on the modes the agent has to manipulate reality. If it’s only got one mode, then it seems like you could figure out how that cashes out for that mode. But a full agent will have a lot of modes and a way to move negentropy between those modes, and so putting together many modules may not work the way we want it to.
It does seem like this has similar issues as formalizing identity. We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can’t tell the difference between itself and others, then that’ll be really hard to do. (Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that’s not effective if it’s not efficient.)
I’m going to guess this is an easier problem than conquering the universe.
Sure, I’m not asserting anything about how hard it would be to make an AI smart enough to conquer the universe, only about whether it would want to do so.
Could you? The universe is pretty big.
OK, actually measuring it would be tricky. AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem, but that’s not specific to satisficers, so I’ll skip it here.
The approach I would try would depend on the modes the agent has to manipulate reality.
Any action whatsoever by the AI will have effects on every particle in its future lightcone. Such effects may be chaotic enough that mere humans can’t optimize them, but that doesn’t make them small.
Is that the kind of thing you meant by a “mode”? If so, how does it help?
We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can’t tell the difference between itself and others, then that’ll be really hard to do.
Right, but we also don’t want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.
Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that’s not effective if it’s not efficient.
I can’t tell whether you’re asserting that “the efficiency of getting others to do its work” is a factual question that sufficiently smart AI will automatically answer correctly, or agreeing with me that it’s mostly a values question about what you put in the denominator when defining efficiency?
Would the AI be able to come to a conclusion within those constraints, or might it be snagged by the problem of including the negentropy cost of computing its negentropy cost?
AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem
Is this a bug or a feature?
It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it’s tasked with, and the boundaries of that region, and then don’t have it worry about the rest. (This is my reading of Stuart_Armstrong’s idea; the Master AI’s job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)
Right, but we also don’t want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.
If we let Clippy off the hook for the actions of others, I suspect Clippy will care a lot less about controlling others, and see them primarily as potential allies (I can get them to do work for cheap if I’m nice!) rather than potential liabilities (if I don’t flood Tommy’s room with deadly neurotoxin, he might spend a lot of his negentropy!). Clippy can also be much simpler- he doesn’t need to model everyone else and determine whether or not they’re involved in the paperclip manufacturing causal chain.
I can’t tell whether you’re asserting that “the efficiency of getting others to do its work” is a factual question that sufficiently smart AI will automatically answer correctly
I think it’s a factual question that a sufficiently clever AI will learn the correct answer to from experience, but I also agree with you that the denominator matters. I included it mostly to anticipate the question of how Clippy should interpret the existence and actions of other agents.
AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem
Actually, this isn’t entirely an open problem. If the environment is known or mostly known, we can easily define a model of the environment and define a utility function in terms of that model. The problem is that when we expect an AI to build a model of the environment from scratch, we don’t have the model ahead of time to use in the definition of our utility function. We do know what the AI’s measurements will look like since we define what inputs it gets, so we can define a utility function in terms of those. That is when we get the problem where we have no way of making it care about things that it is not directly measuring.
Is this a bug or a feature?
It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it’s tasked with, and the boundaries of that region, and then don’t have it worry about the rest. (This is my reading of Stuart_Armstrong’s idea; the Master AI’s job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
The problem is that when we expect an AI to build a model of the environment from scratch
Is this a wise approach? What does “scratch” mean?
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
Is this a wise approach? What does “scratch” mean?
What I had in mind while writing this was Solomonoff induction. If the AI’s model of the universe could be any computable program, it is hard to detect even a paperclip (impossible in full generality due to Rice’s theorem). On LW, the phrase ‘ontological crisis’ is used to refer to the problem of translating a utility function described in terms of one model of the universe into something that can be use in a different, presumably more accurate, model of the universe. The transition from classical physics to quantum mechanics is an illustrative example; why should or shouldn’t our decisions under many worlds be approximately the same as they would be in a classical universe?
As for whether this is a good idea, it seems much harder, if even possible, to build an AI that doesn’t need to navigate such transitions as it is to build one that can do so.
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
This still seems very dangerous. If there is a boundary beyond which it has no incentive to preserve anything, I think that at least some things outside of that boundary get destroyed by default. Concretely, what if the AI creates self-replicating nanobots and has some system within its region to prevent them from replicating uncontrollably, but there is no such protection in place in the rest of the universe?
I’m going to guess this is an easier problem than conquering the universe.
Could you? The universe is pretty big.
The approach I would try would depend on the modes the agent has to manipulate reality. If it’s only got one mode, then it seems like you could figure out how that cashes out for that mode. But a full agent will have a lot of modes and a way to move negentropy between those modes, and so putting together many modules may not work the way we want it to.
It does seem like this has similar issues as formalizing identity. We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can’t tell the difference between itself and others, then that’ll be really hard to do. (Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that’s not effective if it’s not efficient.)
Sure, I’m not asserting anything about how hard it would be to make an AI smart enough to conquer the universe, only about whether it would want to do so.
OK, actually measuring it would be tricky. AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem, but that’s not specific to satisficers, so I’ll skip it here.
Any action whatsoever by the AI will have effects on every particle in its future lightcone. Such effects may be chaotic enough that mere humans can’t optimize them, but that doesn’t make them small.
Is that the kind of thing you meant by a “mode”? If so, how does it help?
Right, but we also don’t want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.
I can’t tell whether you’re asserting that “the efficiency of getting others to do its work” is a factual question that sufficiently smart AI will automatically answer correctly, or agreeing with me that it’s mostly a values question about what you put in the denominator when defining efficiency?
Would the AI be able to come to a conclusion within those constraints, or might it be snagged by the problem of including the negentropy cost of computing its negentropy cost?
Is this a bug or a feature?
It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it’s tasked with, and the boundaries of that region, and then don’t have it worry about the rest. (This is my reading of Stuart_Armstrong’s idea; the Master AI’s job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)
If we let Clippy off the hook for the actions of others, I suspect Clippy will care a lot less about controlling others, and see them primarily as potential allies (I can get them to do work for cheap if I’m nice!) rather than potential liabilities (if I don’t flood Tommy’s room with deadly neurotoxin, he might spend a lot of his negentropy!). Clippy can also be much simpler- he doesn’t need to model everyone else and determine whether or not they’re involved in the paperclip manufacturing causal chain.
I think it’s a factual question that a sufficiently clever AI will learn the correct answer to from experience, but I also agree with you that the denominator matters. I included it mostly to anticipate the question of how Clippy should interpret the existence and actions of other agents.
Actually, this isn’t entirely an open problem. If the environment is known or mostly known, we can easily define a model of the environment and define a utility function in terms of that model. The problem is that when we expect an AI to build a model of the environment from scratch, we don’t have the model ahead of time to use in the definition of our utility function. We do know what the AI’s measurements will look like since we define what inputs it gets, so we can define a utility function in terms of those. That is when we get the problem where we have no way of making it care about things that it is not directly measuring.
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
Is this a wise approach? What does “scratch” mean?
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
What I had in mind while writing this was Solomonoff induction. If the AI’s model of the universe could be any computable program, it is hard to detect even a paperclip (impossible in full generality due to Rice’s theorem). On LW, the phrase ‘ontological crisis’ is used to refer to the problem of translating a utility function described in terms of one model of the universe into something that can be use in a different, presumably more accurate, model of the universe. The transition from classical physics to quantum mechanics is an illustrative example; why should or shouldn’t our decisions under many worlds be approximately the same as they would be in a classical universe?
As for whether this is a good idea, it seems much harder, if even possible, to build an AI that doesn’t need to navigate such transitions as it is to build one that can do so.
This still seems very dangerous. If there is a boundary beyond which it has no incentive to preserve anything, I think that at least some things outside of that boundary get destroyed by default. Concretely, what if the AI creates self-replicating nanobots and has some system within its region to prevent them from replicating uncontrollably, but there is no such protection in place in the rest of the universe?