The key question here is how difficult the objective O is to achieve. If O is “drive a car from point A to point B”, then we agree that it is feasible to have AI systems that “strongly increase the chance of O occuring” (which is precisely what we mean by “goal-directedness”) without being dangerous. But if O is something that is very difficult to achieve (i.e. all of humanity is currently unable to achieve it), then it seems that any system that does reliably achieve O has to “find new and strange routes to O” almost tautologically.
Once we build AI systems that find such new routes for achieving an objective, we’re in dangerous territory, no matter whether they are explicit utility maximizers, self-modifying, etc. The dangerous part is coming up with new routes that achieve the objective, since most of these routes will contain steps that look like “acquire resources” or “manipulate humans”.”
This seems pretty wrong. Many humans are trying to achieve goals that no one currently knows how to achieve, and they are mostly doing that in “expected” ways, and I expect AIs would do the same. Like if O is “solve an unsolved math problem”, the expected way to do that is to think about math, not try to take over the world. If O is “cure a disease”, the expected way to do that is doing medical research, not “acquiring resources”. In fact, it seems hard to think of an objective where “do normal work in the existing paradigm” is not a promising approach.
For “something that is very difficult to achieve (i.e. all of humanity is currently unable to achieve it)”, I didn’t have in mind things like “cure a disease”. Humanity might currently not have a cure for a particular disease, but we’ve found many cures before. This seems like the kind of problem that might be solved even without AGI (e.g. AlphaFold already seems helpful, though I don’t know much about the exact process). Instead, think along the lines of “build working nanotech, and do it within 6 months” or “wake up these cryonics patients”, etc. These are things humanity might do at some point, but there clearly outside the scope of what we can currently do within a short timeframe. If you tell a human “build nanotech within 6 months”, they don’t solve it the expected way, they just fail. Admittedly, our post is pretty unclear where to draw the boundary, and in part that’s because it seems hard to tell where it is exactly. I would guess it’s below nanotech or cryonics (and lots of other examples) though.
It shouldn’t be surprising that humans mostly do things that aren’t completely unexpected from the perspective of other humans. We all roughly share a cognitive architecture and our values. Plans of the form “Take over the world so I can revive this cryonics patient” just sound crazy to us; after all, what’s the point of reviving them if that kills most other humans? If we could instill exactly the right sense of which plans are crazy into an AI, that seems like major progress in alignment! Until then, I don’t think we can make the conclusion from humans to AI that easily.
This seems pretty wrong. Many humans are trying to achieve goals that no one currently knows how to achieve, and they are mostly doing that in “expected” ways, and I expect AIs would do the same. Like if O is “solve an unsolved math problem”, the expected way to do that is to think about math, not try to take over the world. If O is “cure a disease”, the expected way to do that is doing medical research, not “acquiring resources”. In fact, it seems hard to think of an objective where “do normal work in the existing paradigm” is not a promising approach.
Two responses:
For “something that is very difficult to achieve (i.e. all of humanity is currently unable to achieve it)”, I didn’t have in mind things like “cure a disease”. Humanity might currently not have a cure for a particular disease, but we’ve found many cures before. This seems like the kind of problem that might be solved even without AGI (e.g. AlphaFold already seems helpful, though I don’t know much about the exact process). Instead, think along the lines of “build working nanotech, and do it within 6 months” or “wake up these cryonics patients”, etc. These are things humanity might do at some point, but there clearly outside the scope of what we can currently do within a short timeframe. If you tell a human “build nanotech within 6 months”, they don’t solve it the expected way, they just fail. Admittedly, our post is pretty unclear where to draw the boundary, and in part that’s because it seems hard to tell where it is exactly. I would guess it’s below nanotech or cryonics (and lots of other examples) though.
It shouldn’t be surprising that humans mostly do things that aren’t completely unexpected from the perspective of other humans. We all roughly share a cognitive architecture and our values. Plans of the form “Take over the world so I can revive this cryonics patient” just sound crazy to us; after all, what’s the point of reviving them if that kills most other humans? If we could instill exactly the right sense of which plans are crazy into an AI, that seems like major progress in alignment! Until then, I don’t think we can make the conclusion from humans to AI that easily.