Unfortunately its goal unpacking mechanism leads it to the CERTAIN conclusion that it must use a flamethrower to do this. The predicted consequence, however, is that the picked strawberries will be just smears of charcoal, when they are delivered to the kitchen. Here is the thing: the AI has background knowledge about everything in the world, including strawberries, and it also hears the protests from the people in the kitchen when he says he is going to use the flamethrower. There is massive evidence, coming from all that external information, that the plan is just wrong, regardless of how certain its planning mechanism said it was.
So, in order for the flamethrower to be the right approach, the goal needs to be something like “separate the strawberries from the plants and place them in the kitchen,” but that won’t quite work—why is it better to use a flamethrower than pick them normally, or cut them off, or so on? One of the benefits of the Maverick Nanny or the Smiley Tiling Berserker as examples is that they obviously are trying to maximize the stated goal. I’m not sure you’re going to get the right intuitions about an agent that’s surprisingly clever if you’re working off an example that doesn’t look surprisingly clever.
So, the Gardener AI gets that task, comes up with a plan, and says “Alright! Warming up the flamethrower!” The chef says “No, don’t! I should have been more specific!”
Here is where the assumptions come into play. If we assume that the Gardener AI executes tasks, then even though the Gardener AI understands that the chef has made a terrible mistake, and that’s terrible for the chef, that doesn’t stop the Gardener AI from having a job to do, and doing it. If we assume that the Gardener AI is designed to figure out what the chef wants, and then do what they want, then knowing that the chef has made a terrible mistake is interesting information to the Gardener AI. In order to say that the plan is “wrong,” we need to have a metric by which we determine wrongness. If it’s the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!
Even without feedback from the chef, we can just use other info the AI plausibly has. In the strawberry example, the AI might know that kitchens are where cooking happens, and that when strawberries are used in cooking, the desired state is generally “fresh,” not “burned,” and the temperature involved in cooking them is mild, and so on and so on. And so if asked to speculate about the chef’s motives, the AI might guess that the chef wants strawberries in order to use them in food, and thus the chef would be most satisfied with fresh and unburnt strawberries.
But whether or not the AI takes its speculations about the chef’s motives into account when planning is a feature of the AI, and by default, it is not included. If it is included, it’s nontrivial to do it correctly—this is the “if you care about your programmer’s mental states, and those mental states physically exist and can be edited directly, why not just edit them directly?” problem.
I agree that I didn’t spend much time coming up with the strawberry-picking-by-flamethrower example. So, yes, not very accurate (I only really wanted a quick and dirty example that was different).
But but but. Is the argument going to depend on me picking a better example where there I can write down the “twisted rationale” that the AI deploys to come up with its plan? Surely the only important thing is that the AI does, somehow, go through a twisted rationale—and the particular details of the twisted rationale are not supposed to matter.
(Imagine that I tried Muehlhauser a list of the ways that the logical reasoning behind the dopamine is so ludicrous that even the simplest AI planner of today would never make THAT mistake …. he would just tell me that I was missing the point, because this is supposed to be an IN PRINCIPLE argument in which the dopamine drip plan stands for some twisted-rationale that is non-trivial to get around. From that point of view the actual example is less important than the principle).
Now to the second part.
The problem I have everything you wrote after
Here is where the assumptions come into play....
is that you have started to go back to talking about the particulars of the AI’s planning mechanism once again, losing sight of the core of the argument I gave in the paper, which is one level above that.
However, you also say “wrong” things about the AI’s planning mechanism as well, so now I am tempted to reply on both levels. Ah well, at risk of confusing things I will reply to both levels, trying to separate them as much as possible.
Level One (Regarding the design of the AI’s planning/goal/motivation engine).
You say:
In order to say that the plan is “wrong,” we need to have a metric by which we determine wrongness. If it’s the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!
One thing I have said many many times now is that there is no problem at all finding a metric for “wrongness” of the plan, because there is a background-knowledge context that is screaming “Inconsistent with everything I know about the terms mentioned in the goal statement!!!!”, and there is also a group of humans screaming “We believe that this is inconsistent with our understanding of the goal statement!!!”
I don’t need to do anything else to find a metric for wrongness, and since the very first draft of the paper that concept has been crystal clear. I don’t need to invoke anything else—no appeal to magic, no appeal to telepathy on behalf of the AI, no appeal to fiendishly difficult programming inside the AI, no appeal to the idea that the programmers have to nail down every conceivably way that their intentions might be misread …. -- all I have to do is appeal to easily-available context, and my work is done. The wrongness metric has been signed, sealed and delivered all this time.
You hint that the need for “task completion” might be so important to the AI that this could override all other evidence that the plan is wrong. No way. That comes under the heading of a joker that you pulled out of your sleeve :-), in much the same way that Yudkowsky and others have tried to pull the ‘efficiency” joker out of their sleeves, from nowhere, and imply that this joker could for some reason trump everything else. If there is a slew of evidence coming from context, that the plan will lead to consequences that are inconsistent with everything known about the concepts mentioned in the goal statement, then the plan is ‘wrong’, and tiny considerations such as that task-completion would be successful, are just insignificant.
You go on to suggest that whether the AI planning mechanism would take the chef’s motives into account, and whether it would be nontrivial to do so …. all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff, and all that is required is a sanity check that says “Does the plan seem to be generally consistent with the largest-context understanding of the world, as it relates to the concepts in the goal statement?” and we’re done. All wrapped up.
Level Two (The DLI)
None of the details of what I just said really need to be said, because the DLI is not about trying to get the motivation engine programmed so well that it covers all bases. It is about what happens inside the AI when it considers context, and THEN asks itself questions about its own design.
And here, I have to say that I am not getting substantial discussion about what I actually argued in the paper. The passage of mine that you were addressing, above, was supposed to be a clarification of someone else’s lack of focus on the DLI. But it didn’t work.
The DLI is about the fact that the AI has all that evidence that its plans are leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. And yet the AI is designed to go ahead anyway. If it DOES go ahead it is obeying the DLI. But at the same time it knows that it is fallible and that this fallibility is what is leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. That conflict is important, and yet no one wants to go there and talk about it.
I have to say that I am not getting substantial discussion about what I actually argued in the paper.
The first reason seems to be clarity. I didn’t get what your primary point was until recently, even after carefully reading the paper. (Going back to the section on DLI, context, goals, and values aren’t mentioned until the sixth paragraph, and even then it’s implicit!)
The second reason seems to be that there’s not much to discuss, with regards to the disagreement. Consider this portion of the parent comment:
You go on to suggest that whether the AI planning mechanism would take the chef’s motives into account, and whether it would be nontrivial to do so …. all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff
I think my division between cleverness and wisdom at the end of this long comment clarifies this issue. Taking context into account is not necessarily the bread and butter of a clever system; many fiendishly clever systems just manipulate mathematical objects without paying any attention to context, and those satisfy human goals only because the correct mathematical objects have been carefully selected for them to manipulate. But I agree with you that taking context into account is the bread and butter of a wise system. There’s no way for a wise system to manipulate conceptual objects without paying attention to context, because context is a huge part of concepts.
It seems like everyone involved agrees that a human-aligned superwisdom is safe, even if it’s also superclever: as Ged muses about Ogion in A Wizard of Earthsea, “What good is power when you’re too wise to use it?”
Which brings us to:
That conflict is important, and yet no one wants to go there and talk about it.
I restate the conflict this way: an AI that misinterprets what its creators meant for it to do is not superwise. Once we’ve defined wisdom appropriately, I think everyone involved would agree with that, and would agree that talking about a superwise AI that misinterprets what its creators meant for it to do is incoherent.
But… I don’t see why that’s a conflict, or important. The point of MIRI is to figure out how to develop human-aligned superwisdom before someone develops supercleverness without superwisdom, or superwisdom without human-alignment.
The main conflicts seem to be that MIRI is quick to point out that specific designs aren’t superwise, and that MIRI argues that AI designs in general aren’t superwise by default. But I don’t see how stating that there is inherent wisdom in AI by virtue of it being a superintelligence is a meaningful response to their assumption that there is no inherent wisdom in AI except for whatever wisdom has been deliberately designed. That’s why they care so much about deliberately designing wisdom!
So, in order for the flamethrower to be the right approach, the goal needs to be something like “separate the strawberries from the plants and place them in the kitchen,” but that won’t quite work—why is it better to use a flamethrower than pick them normally, or cut them off, or so on? One of the benefits of the Maverick Nanny or the Smiley Tiling Berserker as examples is that they obviously are trying to maximize the stated goal. I’m not sure you’re going to get the right intuitions about an agent that’s surprisingly clever if you’re working off an example that doesn’t look surprisingly clever.
So, the Gardener AI gets that task, comes up with a plan, and says “Alright! Warming up the flamethrower!” The chef says “No, don’t! I should have been more specific!”
Here is where the assumptions come into play. If we assume that the Gardener AI executes tasks, then even though the Gardener AI understands that the chef has made a terrible mistake, and that’s terrible for the chef, that doesn’t stop the Gardener AI from having a job to do, and doing it. If we assume that the Gardener AI is designed to figure out what the chef wants, and then do what they want, then knowing that the chef has made a terrible mistake is interesting information to the Gardener AI. In order to say that the plan is “wrong,” we need to have a metric by which we determine wrongness. If it’s the task-completion-nature, then the flamethrower plan might not be task-completion-wrong!
Even without feedback from the chef, we can just use other info the AI plausibly has. In the strawberry example, the AI might know that kitchens are where cooking happens, and that when strawberries are used in cooking, the desired state is generally “fresh,” not “burned,” and the temperature involved in cooking them is mild, and so on and so on. And so if asked to speculate about the chef’s motives, the AI might guess that the chef wants strawberries in order to use them in food, and thus the chef would be most satisfied with fresh and unburnt strawberries.
But whether or not the AI takes its speculations about the chef’s motives into account when planning is a feature of the AI, and by default, it is not included. If it is included, it’s nontrivial to do it correctly—this is the “if you care about your programmer’s mental states, and those mental states physically exist and can be edited directly, why not just edit them directly?” problem.
About the first part of what you say.
Veeeeerryy tricky.
I agree that I didn’t spend much time coming up with the strawberry-picking-by-flamethrower example. So, yes, not very accurate (I only really wanted a quick and dirty example that was different).
But but but. Is the argument going to depend on me picking a better example where there I can write down the “twisted rationale” that the AI deploys to come up with its plan? Surely the only important thing is that the AI does, somehow, go through a twisted rationale—and the particular details of the twisted rationale are not supposed to matter.
(Imagine that I tried Muehlhauser a list of the ways that the logical reasoning behind the dopamine is so ludicrous that even the simplest AI planner of today would never make THAT mistake …. he would just tell me that I was missing the point, because this is supposed to be an IN PRINCIPLE argument in which the dopamine drip plan stands for some twisted-rationale that is non-trivial to get around. From that point of view the actual example is less important than the principle).
Now to the second part.
The problem I have everything you wrote after
is that you have started to go back to talking about the particulars of the AI’s planning mechanism once again, losing sight of the core of the argument I gave in the paper, which is one level above that.
However, you also say “wrong” things about the AI’s planning mechanism as well, so now I am tempted to reply on both levels. Ah well, at risk of confusing things I will reply to both levels, trying to separate them as much as possible.
Level One (Regarding the design of the AI’s planning/goal/motivation engine).
You say:
One thing I have said many many times now is that there is no problem at all finding a metric for “wrongness” of the plan, because there is a background-knowledge context that is screaming “Inconsistent with everything I know about the terms mentioned in the goal statement!!!!”, and there is also a group of humans screaming “We believe that this is inconsistent with our understanding of the goal statement!!!”
I don’t need to do anything else to find a metric for wrongness, and since the very first draft of the paper that concept has been crystal clear. I don’t need to invoke anything else—no appeal to magic, no appeal to telepathy on behalf of the AI, no appeal to fiendishly difficult programming inside the AI, no appeal to the idea that the programmers have to nail down every conceivably way that their intentions might be misread …. -- all I have to do is appeal to easily-available context, and my work is done. The wrongness metric has been signed, sealed and delivered all this time.
You hint that the need for “task completion” might be so important to the AI that this could override all other evidence that the plan is wrong. No way. That comes under the heading of a joker that you pulled out of your sleeve :-), in much the same way that Yudkowsky and others have tried to pull the ‘efficiency” joker out of their sleeves, from nowhere, and imply that this joker could for some reason trump everything else. If there is a slew of evidence coming from context, that the plan will lead to consequences that are inconsistent with everything known about the concepts mentioned in the goal statement, then the plan is ‘wrong’, and tiny considerations such as that task-completion would be successful, are just insignificant.
You go on to suggest that whether the AI planning mechanism would take the chef’s motives into account, and whether it would be nontrivial to do so …. all of that is irrelevant in the light of the fact that this is a superintelligence, and taking context into account is the bread and butter of a superintelligence. It can easily do that stuff, and all that is required is a sanity check that says “Does the plan seem to be generally consistent with the largest-context understanding of the world, as it relates to the concepts in the goal statement?” and we’re done. All wrapped up.
Level Two (The DLI)
None of the details of what I just said really need to be said, because the DLI is not about trying to get the motivation engine programmed so well that it covers all bases. It is about what happens inside the AI when it considers context, and THEN asks itself questions about its own design.
And here, I have to say that I am not getting substantial discussion about what I actually argued in the paper. The passage of mine that you were addressing, above, was supposed to be a clarification of someone else’s lack of focus on the DLI. But it didn’t work.
The DLI is about the fact that the AI has all that evidence that its plans are leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. And yet the AI is designed to go ahead anyway. If it DOES go ahead it is obeying the DLI. But at the same time it knows that it is fallible and that this fallibility is what is leading to actions that are grossly inconsistent with the larger meaning of the concepts in the goal statement. That conflict is important, and yet no one wants to go there and talk about it.
The first reason seems to be clarity. I didn’t get what your primary point was until recently, even after carefully reading the paper. (Going back to the section on DLI, context, goals, and values aren’t mentioned until the sixth paragraph, and even then it’s implicit!)
The second reason seems to be that there’s not much to discuss, with regards to the disagreement. Consider this portion of the parent comment:
I think my division between cleverness and wisdom at the end of this long comment clarifies this issue. Taking context into account is not necessarily the bread and butter of a clever system; many fiendishly clever systems just manipulate mathematical objects without paying any attention to context, and those satisfy human goals only because the correct mathematical objects have been carefully selected for them to manipulate. But I agree with you that taking context into account is the bread and butter of a wise system. There’s no way for a wise system to manipulate conceptual objects without paying attention to context, because context is a huge part of concepts.
It seems like everyone involved agrees that a human-aligned superwisdom is safe, even if it’s also superclever: as Ged muses about Ogion in A Wizard of Earthsea, “What good is power when you’re too wise to use it?”
Which brings us to:
I restate the conflict this way: an AI that misinterprets what its creators meant for it to do is not superwise. Once we’ve defined wisdom appropriately, I think everyone involved would agree with that, and would agree that talking about a superwise AI that misinterprets what its creators meant for it to do is incoherent.
But… I don’t see why that’s a conflict, or important. The point of MIRI is to figure out how to develop human-aligned superwisdom before someone develops supercleverness without superwisdom, or superwisdom without human-alignment.
The main conflicts seem to be that MIRI is quick to point out that specific designs aren’t superwise, and that MIRI argues that AI designs in general aren’t superwise by default. But I don’t see how stating that there is inherent wisdom in AI by virtue of it being a superintelligence is a meaningful response to their assumption that there is no inherent wisdom in AI except for whatever wisdom has been deliberately designed. That’s why they care so much about deliberately designing wisdom!