OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about “thinking about politics” or “breaking laws” etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
(I didn’t find the ”...something has gone extremely wrong in a way that feels preventable” as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don’t know what you are doing, that’s totally preventable too because if you were an elite circus trainer you would have done it correctly.)
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about “thinking about politics” or “breaking laws” etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
I don’t really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking—law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.
(I didn’t find the ”...something has gone extremely wrong in a way that feels preventable” as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don’t know what you are doing, that’s totally preventable too because if you were an elite circus trainer you would have done it correctly.)
I’m not really sure if or how this is a reductio. I don’t think it’s a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that’s really all I want to say—that this failure seems preventable, and that intuition doesn’t seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn’t empirically contingent.
Thinking about politics may not be a failure mode; my question was whether it feels “extreme and somewhat strange,” sorry for not clarifying. Like, suppose for some reason “doesn’t think about politics” was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?
I’d be interested to hear more about the law-breaking stuff—what is it about some laws that makes AI breaking them unsurprising/normal/hard-to-avoid, whereas for others AI breaking them is perverse/strange/avoidable?
I wasn’t constructing a reductio, just explaining why the phrase didn’t help me understand your view/intuition. When I hear that phrase, it seems to me to apply equally to the grenade case, the lion-bites-head-off case, the AI-is-egregiously-misaligned case, etc. All of those cases feel the same to me.
(I do notice a difference between these cases and the bridge case. With the bridge, there’s some sense in which no way you could have made the bridge would be good enough to prevent a certain sufficiently heavy load. By contrast, with AI, lions, and rocket-armchairs, there’s at least some possible way to handle it well besides “just don’t do it in the first place.” Is this the distinction you are talking about?)
Is your claim just that the solubility of the alignment problem is not empirically contingent, i.e. there is no possible world (no set of laws of physics and initial conditions) such that someone like us builds some sort of super-smart AI, and it becomes egregiously misaligned, and there was no way for them to have built the AI without it becoming egregiously misaligned?
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about “thinking about politics” or “breaking laws” etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
(I didn’t find the ”...something has gone extremely wrong in a way that feels preventable” as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don’t know what you are doing, that’s totally preventable too because if you were an elite circus trainer you would have done it correctly.)
I don’t really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking—law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.
I’m not really sure if or how this is a reductio. I don’t think it’s a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that’s really all I want to say—that this failure seems preventable, and that intuition doesn’t seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn’t empirically contingent.
Thinking about politics may not be a failure mode; my question was whether it feels “extreme and somewhat strange,” sorry for not clarifying. Like, suppose for some reason “doesn’t think about politics” was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?
I’d be interested to hear more about the law-breaking stuff—what is it about some laws that makes AI breaking them unsurprising/normal/hard-to-avoid, whereas for others AI breaking them is perverse/strange/avoidable?
I wasn’t constructing a reductio, just explaining why the phrase didn’t help me understand your view/intuition. When I hear that phrase, it seems to me to apply equally to the grenade case, the lion-bites-head-off case, the AI-is-egregiously-misaligned case, etc. All of those cases feel the same to me.
(I do notice a difference between these cases and the bridge case. With the bridge, there’s some sense in which no way you could have made the bridge would be good enough to prevent a certain sufficiently heavy load. By contrast, with AI, lions, and rocket-armchairs, there’s at least some possible way to handle it well besides “just don’t do it in the first place.” Is this the distinction you are talking about?)
Is your claim just that the solubility of the alignment problem is not empirically contingent, i.e. there is no possible world (no set of laws of physics and initial conditions) such that someone like us builds some sort of super-smart AI, and it becomes egregiously misaligned, and there was no way for them to have built the AI without it becoming egregiously misaligned?