The most obvious path to coding reduced impact is to build a satisficer rather than a maximiser—but that proved unlikely to work.
As I commented there: I don’t think you’re using a useful definition for “satisficer,” and I’m troubled by your use of the word “proved.”
If I build a Clippy whose utility function is Num_Paperclips—Negentropy_Cost, then I expect it to increase the number of paperclips until the marginal benefit is lower than the marginal cost, and if I do F(Num_Paperclips)-G(Negentropy_Cost), where F is concave and G is convex, then it’s even less likely to go foom because marginal benefit is penalized and marginal cost is overcounted. Is there a good reason to expect this won’t work?
How do you plan to formalize negentropy spent on this goal? If you measure the total negentropy in the universe, then such a Clippy will indeed stop making paperclips at some point, but it will also take over the universe to prevent anyone else from using up negentropy. If you measure only the negentropy in some limited system that you intended Clippy to draw from, then that just gives Clippy an incentive to steal resources from somewhere else; or equivalently, to look for actions that have a side effect of causing other agents elsewhere to build paperclips out of resources that Clippy isn’t being billed for. Am I totally missing some simple third option?
How do you plan to formalize negentropy spent on this goal?
I’m going to guess this is an easier problem than conquering the universe.
If you measure the total negentropy in the universe
Could you? The universe is pretty big.
The approach I would try would depend on the modes the agent has to manipulate reality. If it’s only got one mode, then it seems like you could figure out how that cashes out for that mode. But a full agent will have a lot of modes and a way to move negentropy between those modes, and so putting together many modules may not work the way we want it to.
It does seem like this has similar issues as formalizing identity. We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can’t tell the difference between itself and others, then that’ll be really hard to do. (Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that’s not effective if it’s not efficient.)
I’m going to guess this is an easier problem than conquering the universe.
Sure, I’m not asserting anything about how hard it would be to make an AI smart enough to conquer the universe, only about whether it would want to do so.
Could you? The universe is pretty big.
OK, actually measuring it would be tricky. AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem, but that’s not specific to satisficers, so I’ll skip it here.
The approach I would try would depend on the modes the agent has to manipulate reality.
Any action whatsoever by the AI will have effects on every particle in its future lightcone. Such effects may be chaotic enough that mere humans can’t optimize them, but that doesn’t make them small.
Is that the kind of thing you meant by a “mode”? If so, how does it help?
We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can’t tell the difference between itself and others, then that’ll be really hard to do.
Right, but we also don’t want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.
Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that’s not effective if it’s not efficient.
I can’t tell whether you’re asserting that “the efficiency of getting others to do its work” is a factual question that sufficiently smart AI will automatically answer correctly, or agreeing with me that it’s mostly a values question about what you put in the denominator when defining efficiency?
Would the AI be able to come to a conclusion within those constraints, or might it be snagged by the problem of including the negentropy cost of computing its negentropy cost?
AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem
Is this a bug or a feature?
It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it’s tasked with, and the boundaries of that region, and then don’t have it worry about the rest. (This is my reading of Stuart_Armstrong’s idea; the Master AI’s job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)
Right, but we also don’t want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.
If we let Clippy off the hook for the actions of others, I suspect Clippy will care a lot less about controlling others, and see them primarily as potential allies (I can get them to do work for cheap if I’m nice!) rather than potential liabilities (if I don’t flood Tommy’s room with deadly neurotoxin, he might spend a lot of his negentropy!). Clippy can also be much simpler- he doesn’t need to model everyone else and determine whether or not they’re involved in the paperclip manufacturing causal chain.
I can’t tell whether you’re asserting that “the efficiency of getting others to do its work” is a factual question that sufficiently smart AI will automatically answer correctly
I think it’s a factual question that a sufficiently clever AI will learn the correct answer to from experience, but I also agree with you that the denominator matters. I included it mostly to anticipate the question of how Clippy should interpret the existence and actions of other agents.
AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem
Actually, this isn’t entirely an open problem. If the environment is known or mostly known, we can easily define a model of the environment and define a utility function in terms of that model. The problem is that when we expect an AI to build a model of the environment from scratch, we don’t have the model ahead of time to use in the definition of our utility function. We do know what the AI’s measurements will look like since we define what inputs it gets, so we can define a utility function in terms of those. That is when we get the problem where we have no way of making it care about things that it is not directly measuring.
Is this a bug or a feature?
It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it’s tasked with, and the boundaries of that region, and then don’t have it worry about the rest. (This is my reading of Stuart_Armstrong’s idea; the Master AI’s job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
The problem is that when we expect an AI to build a model of the environment from scratch
Is this a wise approach? What does “scratch” mean?
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
Is this a wise approach? What does “scratch” mean?
What I had in mind while writing this was Solomonoff induction. If the AI’s model of the universe could be any computable program, it is hard to detect even a paperclip (impossible in full generality due to Rice’s theorem). On LW, the phrase ‘ontological crisis’ is used to refer to the problem of translating a utility function described in terms of one model of the universe into something that can be use in a different, presumably more accurate, model of the universe. The transition from classical physics to quantum mechanics is an illustrative example; why should or shouldn’t our decisions under many worlds be approximately the same as they would be in a classical universe?
As for whether this is a good idea, it seems much harder, if even possible, to build an AI that doesn’t need to navigate such transitions as it is to build one that can do so.
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
This still seems very dangerous. If there is a boundary beyond which it has no incentive to preserve anything, I think that at least some things outside of that boundary get destroyed by default. Concretely, what if the AI creates self-replicating nanobots and has some system within its region to prevent them from replicating uncontrollably, but there is no such protection in place in the rest of the universe?
You probably want to do something to escape those underscores.
I’m troubled by your use of the word “proved.”
That’s a standard sense of the word ‘proved’, which is usually identifiable by its lack of a direct object. It just means that something turned out that way, or the evidence points that way.
It just means that something turned out that way, or the evidence points that way.
My complaint is twofold: first, I don’t think the evidence points that way, and second, I would prefer them saying the evidence pointed that way to them using a stronger phrase.
I would prefer them saying the evidence pointed that way to them using a stronger phrase.
But that’s not what that means—it’s not very strong. If I say, “My search proved fruitful”, then I’m not saying anything particularly strong—just that I found something. Saying “that proved unlikely to work” just means “based on , I’ve observed that it’s unlikely to work”. can be a search, some research, an experiment, or anything of that sort.
Note that this sense of “proved” does not even need to imply a particular conclusion—“The experiment proved inconclusive”.
This is more similar to the use of “proof” in baking or alcohol than the use of “proof” in geometry or logic.
As I commented there: I don’t think you’re using a useful definition for “satisficer,” and I’m troubled by your use of the word “proved.”
If I build a Clippy whose utility function is Num_Paperclips—Negentropy_Cost, then I expect it to increase the number of paperclips until the marginal benefit is lower than the marginal cost, and if I do F(Num_Paperclips)-G(Negentropy_Cost), where F is concave and G is convex, then it’s even less likely to go foom because marginal benefit is penalized and marginal cost is overcounted. Is there a good reason to expect this won’t work?
(Will comment on the rest of the article later.)
How do you plan to formalize negentropy spent on this goal? If you measure the total negentropy in the universe, then such a Clippy will indeed stop making paperclips at some point, but it will also take over the universe to prevent anyone else from using up negentropy. If you measure only the negentropy in some limited system that you intended Clippy to draw from, then that just gives Clippy an incentive to steal resources from somewhere else; or equivalently, to look for actions that have a side effect of causing other agents elsewhere to build paperclips out of resources that Clippy isn’t being billed for. Am I totally missing some simple third option?
I’m going to guess this is an easier problem than conquering the universe.
Could you? The universe is pretty big.
The approach I would try would depend on the modes the agent has to manipulate reality. If it’s only got one mode, then it seems like you could figure out how that cashes out for that mode. But a full agent will have a lot of modes and a way to move negentropy between those modes, and so putting together many modules may not work the way we want it to.
It does seem like this has similar issues as formalizing identity. We want to charge Clippy when it thinks and moves, but not when others think and move- but if Clippy can’t tell the difference between itself and others, then that’ll be really hard to do. (Clippy will probably try to shirk and get others to do its work- but that may be efficient behavior, and it should learn that’s not effective if it’s not efficient.)
Sure, I’m not asserting anything about how hard it would be to make an AI smart enough to conquer the universe, only about whether it would want to do so.
OK, actually measuring it would be tricky. AFAIK, designing an AI that cares about features of the environment that it’s not directly measuring is another open problem, but that’s not specific to satisficers, so I’ll skip it here.
Any action whatsoever by the AI will have effects on every particle in its future lightcone. Such effects may be chaotic enough that mere humans can’t optimize them, but that doesn’t make them small.
Is that the kind of thing you meant by a “mode”? If so, how does it help?
Right, but we also don’t want to let Clippy off the hook just because there are other agents in the causal chain between it and the paperclips, if Clippy influenced their decisions or desires.
I can’t tell whether you’re asserting that “the efficiency of getting others to do its work” is a factual question that sufficiently smart AI will automatically answer correctly, or agreeing with me that it’s mostly a values question about what you put in the denominator when defining efficiency?
Would the AI be able to come to a conclusion within those constraints, or might it be snagged by the problem of including the negentropy cost of computing its negentropy cost?
Is this a bug or a feature?
It may be a lot easier to design a reduced impact AI if you start off with reduced scope. Have it care about the region it’s tasked with, and the boundaries of that region, and then don’t have it worry about the rest. (This is my reading of Stuart_Armstrong’s idea; the Master AI’s job is to write the utility function and boundary conditions for the Disciple AI, which will actually be given actuators and sensors.)
If we let Clippy off the hook for the actions of others, I suspect Clippy will care a lot less about controlling others, and see them primarily as potential allies (I can get them to do work for cheap if I’m nice!) rather than potential liabilities (if I don’t flood Tommy’s room with deadly neurotoxin, he might spend a lot of his negentropy!). Clippy can also be much simpler- he doesn’t need to model everyone else and determine whether or not they’re involved in the paperclip manufacturing causal chain.
I think it’s a factual question that a sufficiently clever AI will learn the correct answer to from experience, but I also agree with you that the denominator matters. I included it mostly to anticipate the question of how Clippy should interpret the existence and actions of other agents.
Actually, this isn’t entirely an open problem. If the environment is known or mostly known, we can easily define a model of the environment and define a utility function in terms of that model. The problem is that when we expect an AI to build a model of the environment from scratch, we don’t have the model ahead of time to use in the definition of our utility function. We do know what the AI’s measurements will look like since we define what inputs it gets, so we can define a utility function in terms of those. That is when we get the problem where we have no way of making it care about things that it is not directly measuring.
“Don’t worry about the rest” isn’t something we want an AI to do. If its utility function makes no explicit reference to the rest of the universe, it has no incentive not to replace it with more computing power that it can use to better optimize the region that it does care about.
Is this a wise approach? What does “scratch” mean?
That’s what the boundary conditions are for. A fully formalized version of “don’t trust as valid any computations run outside of your region” seems like the easiest way to disincentivize the AI from trying to run computations in the rest of the universe.
What I had in mind while writing this was Solomonoff induction. If the AI’s model of the universe could be any computable program, it is hard to detect even a paperclip (impossible in full generality due to Rice’s theorem). On LW, the phrase ‘ontological crisis’ is used to refer to the problem of translating a utility function described in terms of one model of the universe into something that can be use in a different, presumably more accurate, model of the universe. The transition from classical physics to quantum mechanics is an illustrative example; why should or shouldn’t our decisions under many worlds be approximately the same as they would be in a classical universe?
As for whether this is a good idea, it seems much harder, if even possible, to build an AI that doesn’t need to navigate such transitions as it is to build one that can do so.
This still seems very dangerous. If there is a boundary beyond which it has no incentive to preserve anything, I think that at least some things outside of that boundary get destroyed by default. Concretely, what if the AI creates self-replicating nanobots and has some system within its region to prevent them from replicating uncontrollably, but there is no such protection in place in the rest of the universe?
You probably want to do something to escape those underscores.
That’s a standard sense of the word ‘proved’, which is usually identifiable by its lack of a direct object. It just means that something turned out that way, or the evidence points that way.
Thanks, I noticed and fixed that.
My complaint is twofold: first, I don’t think the evidence points that way, and second, I would prefer them saying the evidence pointed that way to them using a stronger phrase.
But that’s not what that means—it’s not very strong. If I say, “My search proved fruitful”, then I’m not saying anything particularly strong—just that I found something. Saying “that proved unlikely to work” just means “based on , I’ve observed that it’s unlikely to work”. can be a search, some research, an experiment, or anything of that sort.
Note that this sense of “proved” does not even need to imply a particular conclusion—“The experiment proved inconclusive”.
This is more similar to the use of “proof” in baking or alcohol than the use of “proof” in geometry or logic.