“You write that the worry is that the superintelligence won’t care. My response is that, to work at all, it will have to care about a lot. For example, it will have to care about achieving accurate beliefs about the world. It will have to care to devise plans to overpower humanity and not get caught. If it cares about those activities, then how is it more difficult to make it care to understand and do what humans mean?”
“If an AI is meant to behave generally intelligent [sic] then it will have to work as intended or otherwise fail to be generally intelligent.”
It’s relatively easy to get an AI to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.
‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence. Your argument is misleading because it trades on treating this simple phrase as though it were all-or-nothing, a monolith; but all failures for a device to ‘work as intended’ in human history have involved at least some of the intended properties of that device coming to fruition.
It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. As a programmer, being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. See the last section of my post above.
I suggest that it’s a straw man to claim that anyone has argued ‘the superintelligence wouldn’t understand what you wanted it to do, if you didn’t program it to fully understand that at the outset’. Do you have evidence that this is a position held by, say, anyone at MIRI? The post you’re replying to points out that the real claim is that the superintelligence won’t care what you wanted it to do, if you didn’t program it to care about the specific right thing at the outset. That makes your criticism seem very much like a change of topic.
Superintelligence may imply an ability to understand instructions, but it doesn’t imply a desire to rewrite one’s utility function to better reflect human values. Any such desire would need to come from the utility function itself, and if we’re worried that humans may get that utility function wrong, then we should also be worried that humans may get the part of the utility function that modifies the utility function wrong.
I suggest that it’s a straw man to claim that anyone has argued ‘the superintelligence wouldn’t understand what you wanted it to do, if you didn’t program it to fully understand that at the outset’. Do you have evidence that this is a position held by, say, anyone at MIRI?
MIRI assumes that programming what you want an AI to do at the outset , Big Design Up Front, is a desirable feature for some reason.
The most common argument is that it is a necessary prerequisite for provable correctness, which is a desirable safety feature. OTOH, the exact opposite of massive hardcoding, goal flexibility is ielf a necessary prerequisite for corrigibility, which is itself a desirable safety feature.
The latter point has not been argued against adequately, IMO.
This mirrors some comments you wrote recently:
It’s relatively easy to get an AI to care about (optimize for) something-or-other; what’s hard is getting one to care about the right something.
‘Working as intended’ is a simple phrase, but behind it lies a monstrously complex referent. It doesn’t clearly distinguish the programmers’ (mostly implicit) true preferences from their stated design objectives; an AI’s actual code can differ from either or both of these. Crucially, what an AI is ‘intended’ for isn’t all-or-nothing. It can fail in some ways without failing in every way, and small errors will tend to kill Friendliness much more easily than intelligence. Your argument is misleading because it trades on treating this simple phrase as though it were all-or-nothing, a monolith; but all failures for a device to ‘work as intended’ in human history have involved at least some of the intended properties of that device coming to fruition.
It may be hard to build self-modifying AGI. But it’s not the same hardness as the hardness of Friendliness Theory. As a programmer, being able to hit one small target doesn’t entail that you can or will hit every small target it would be in your best interest to hit. See the last section of my post above.
I suggest that it’s a straw man to claim that anyone has argued ‘the superintelligence wouldn’t understand what you wanted it to do, if you didn’t program it to fully understand that at the outset’. Do you have evidence that this is a position held by, say, anyone at MIRI? The post you’re replying to points out that the real claim is that the superintelligence won’t care what you wanted it to do, if you didn’t program it to care about the specific right thing at the outset. That makes your criticism seem very much like a change of topic.
Superintelligence may imply an ability to understand instructions, but it doesn’t imply a desire to rewrite one’s utility function to better reflect human values. Any such desire would need to come from the utility function itself, and if we’re worried that humans may get that utility function wrong, then we should also be worried that humans may get the part of the utility function that modifies the utility function wrong.
MIRI assumes that programming what you want an AI to do at the outset , Big Design Up Front, is a desirable feature for some reason.
The most common argument is that it is a necessary prerequisite for provable correctness, which is a desirable safety feature. OTOH, the exact opposite of massive hardcoding, goal flexibility is ielf a necessary prerequisite for corrigibility, which is itself a desirable safety feature.
The latter point has not been argued against adequately, IMO.