Tools come up with plans to maximize some utility measure P, but they don’t actually have any external criteria of optimality.
What’s the distinction between “external” optimality criteria and the kind that describes the way Tool AIs choose their output among all possible outputs? (A possible response is that Tool AIs are not themselves running a consequentialist algorithm, which would make it harder to stipulate the nature of their optimization power.)
Well, my understanding is that when a Tool AI makes a list of the best plans according to P, and an Oracle AI chooses an output maximizing U, the Oracle cares about something other than “giving the right answer to this question”—it cares about “answering questions” in general, or whatever, something that gives it a motive to manipulate things outside of the realm of the particular question under consideration.
The “external” distinction is that the Oracle potentially gets utility from something persistent and external to the question. Basically, it’s an explicit utility maximizer, and that causes problems. This is just my understanding of the arguments, though, I’m not sure whether the distinction is coherent in the final working!
Edit: And in fact, a Tool isn’t trying to produce output that maximizes P! It doesn’t care about that. It just cares about correctly reporting the plans that give the highest values for P.
It just cares about correctly reporting the plans that give the highest values for P.
This is what I meant by “not running a consequentialist algorithm”: what matters here is the way in which P depends on a plan.
If P is saying something about how human operators would respond to observing the plan, it introduces a consequentialist aspect into AI’s optimization criteria: it starts to matter what are the consequences of producing a plan, its value depends on the effect produced by choosing it. On the other hand, if P doesn’t say things like that, it might be the case that the value of a plan is not being evaluated consequentialistically, but that might make it more difficult to specify what constitutes a good plan, since plan’s (expected) consequences give a natural (basis for a) metric of its quality.
Hm. This is an intriguing point. I thought by “maximize the actual outcome according to its own criteria of optimality” you meant U, which is my understanding of what an Oracle would do, but instead you meant it would produce plans so as to maximize P, rather than producing plans that would maximize P if implemented, is that about right?
I guess you’d have to produce some list of plans such that each would produce high value for P if selected (which includes an expectation that they would be successfully implemented if selected), given that they appear on the list and all the other plans do as well… you wouldn’t necessarily have to worry about other influences the plan list might have, would you?
Perhaps if we had a more concrete example:
Suppose we ask the AI to advise us on building a sturdy bridge over some river (valuing both sturdiness and bridgeness, probably other things like speed of building, etc.). Stuart_Armstrong’s version would select a list of plans such that given that the operators will view that list, if they select one of the plans, then the AI predicts that they will successfully build a sturdy bridge (or that a sturdy bridge will otherwise come into being). I admit I find the subject a little confusing, but does that sound about right?
What’s the distinction between “external” optimality criteria and the kind that describes the way Tool AIs choose their output among all possible outputs? (A possible response is that Tool AIs are not themselves running a consequentialist algorithm, which would make it harder to stipulate the nature of their optimization power.)
Well, my understanding is that when a Tool AI makes a list of the best plans according to P, and an Oracle AI chooses an output maximizing U, the Oracle cares about something other than “giving the right answer to this question”—it cares about “answering questions” in general, or whatever, something that gives it a motive to manipulate things outside of the realm of the particular question under consideration.
The “external” distinction is that the Oracle potentially gets utility from something persistent and external to the question. Basically, it’s an explicit utility maximizer, and that causes problems. This is just my understanding of the arguments, though, I’m not sure whether the distinction is coherent in the final working!
Edit: And in fact, a Tool isn’t trying to produce output that maximizes P! It doesn’t care about that. It just cares about correctly reporting the plans that give the highest values for P.
This is what I meant by “not running a consequentialist algorithm”: what matters here is the way in which P depends on a plan.
If P is saying something about how human operators would respond to observing the plan, it introduces a consequentialist aspect into AI’s optimization criteria: it starts to matter what are the consequences of producing a plan, its value depends on the effect produced by choosing it. On the other hand, if P doesn’t say things like that, it might be the case that the value of a plan is not being evaluated consequentialistically, but that might make it more difficult to specify what constitutes a good plan, since plan’s (expected) consequences give a natural (basis for a) metric of its quality.
Hm. This is an intriguing point. I thought by “maximize the actual outcome according to its own criteria of optimality” you meant U, which is my understanding of what an Oracle would do, but instead you meant it would produce plans so as to maximize P, rather than producing plans that would maximize P if implemented, is that about right?
I guess you’d have to produce some list of plans such that each would produce high value for P if selected (which includes an expectation that they would be successfully implemented if selected), given that they appear on the list and all the other plans do as well… you wouldn’t necessarily have to worry about other influences the plan list might have, would you?
Perhaps if we had a more concrete example:
Suppose we ask the AI to advise us on building a sturdy bridge over some river (valuing both sturdiness and bridgeness, probably other things like speed of building, etc.). Stuart_Armstrong’s version would select a list of plans such that given that the operators will view that list, if they select one of the plans, then the AI predicts that they will successfully build a sturdy bridge (or that a sturdy bridge will otherwise come into being). I admit I find the subject a little confusing, but does that sound about right?