If it can understand “I have had little effect on the world”, it can understand “I am doing good for humanity”. A “safe” utility function would be no easier and less desirable than a Friendly one.
No easier? There’s a lot of hidden content in “effect on the world”, but presumably not all of Fun Theory, the entire definition of “person”, etc. (or shorter descriptions that unfold into these things). An Oracle AI that worked for humans would probably work just as well for Babyeaters or Superhappies (in terms of not automatically destroying things they value; obviously, it’d make alien assumptions about cognitive style, concepts, etc.).
I agree with that much, but the question is whether there’s enough hidden content to force development of a general theory of “learning what the programmers actually meant” that would be sufficient unto full-scale FAI, or sufficient given 20% more work.
Does moving a few ounces of matter from one location to another count as a significant “effect on the world”?
Does it matter to you whether that matter is taken from 1) a vital component of the detonator on a bomb in a densely populated area or 2) the frontal lobe of your brain?
If it does matter to you, how do you propose to explain the difference to an AI?
Does moving a few ounces of matter from one location to another count as a significant “effect on the world”?
In general, yes; you can and should be much more conservative here than would fully reflect your preferences, and give it a principle implying your (1) and (2) are both Very Bad.
But, the waste heat from its computation will move at least a few ounces of air.
Maybe you can get around this by having it not worry (so to speak) about effects other than through I/O, but this is unsafe if it can use channels you didn’t think of to deliberately influence the world. Certainly other problems, too – but (it seems to me) problems that have to be solved anyway to implement CEV, which is sort of a special case of Oracle AI.
But, the waste heat from its computation will move at least a few ounces of air.
Quite so. The waste heat, of course, has very little thermodynamically significant direct impact on the rest of the world—but by the same token, removing someone’s frontal lobe or not has a smaller, more indirect impact on the world than preventing the bomb from detonating or not.
Now, suppose the AI’s grasp of causal structure is sufficient that it will indeed only take actions that truly have minimal impact vs. nonaction; in this case it will be unable to communicate with humans in ways that are expected to result in significant changes to the human’s future behavior, making it a singularly useless oracle.
My intuition here is that the insights required for any specification of what causal results of action are acceptable is roughly equivalent to what is necessary to specify something like CEV (i.e., essentially what Warrigal said above) in that both require the AI have, roughly speaking, the ability to figure out what people actually want, not what they say they want. If you’ve done it right, you don’t need additional safeguards such as preventing significant effects; if you’ve done it wrong, you’re probably screwed anyways.
If it can understand “I have had little effect on the world”, it can understand “I am doing good for humanity”. A “safe” utility function would be no easier and less desirable than a Friendly one.
No easier? There’s a lot of hidden content in “effect on the world”, but presumably not all of Fun Theory, the entire definition of “person”, etc. (or shorter descriptions that unfold into these things). An Oracle AI that worked for humans would probably work just as well for Babyeaters or Superhappies (in terms of not automatically destroying things they value; obviously, it’d make alien assumptions about cognitive style, concepts, etc.).
I agree with that much, but the question is whether there’s enough hidden content to force development of a general theory of “learning what the programmers actually meant” that would be sufficient unto full-scale FAI, or sufficient given 20% more work.
Does moving a few ounces of matter from one location to another count as a significant “effect on the world”?
Does it matter to you whether that matter is taken from 1) a vital component of the detonator on a bomb in a densely populated area or 2) the frontal lobe of your brain?
If it does matter to you, how do you propose to explain the difference to an AI?
In general, yes; you can and should be much more conservative here than would fully reflect your preferences, and give it a principle implying your (1) and (2) are both Very Bad.
But, the waste heat from its computation will move at least a few ounces of air.
Maybe you can get around this by having it not worry (so to speak) about effects other than through I/O, but this is unsafe if it can use channels you didn’t think of to deliberately influence the world. Certainly other problems, too – but (it seems to me) problems that have to be solved anyway to implement CEV, which is sort of a special case of Oracle AI.
Quite so. The waste heat, of course, has very little thermodynamically significant direct impact on the rest of the world—but by the same token, removing someone’s frontal lobe or not has a smaller, more indirect impact on the world than preventing the bomb from detonating or not.
Now, suppose the AI’s grasp of causal structure is sufficient that it will indeed only take actions that truly have minimal impact vs. nonaction; in this case it will be unable to communicate with humans in ways that are expected to result in significant changes to the human’s future behavior, making it a singularly useless oracle.
My intuition here is that the insights required for any specification of what causal results of action are acceptable is roughly equivalent to what is necessary to specify something like CEV (i.e., essentially what Warrigal said above) in that both require the AI have, roughly speaking, the ability to figure out what people actually want, not what they say they want. If you’ve done it right, you don’t need additional safeguards such as preventing significant effects; if you’ve done it wrong, you’re probably screwed anyways.