(1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
(2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn’t really care. (This is within the “without ever explicitly thinking about the fact that humans are resisting it” scenario.) This is isomorphic to ELK.
If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the “oh yeah I am definitely a paperclipper” scenario.
Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario).
If we can’t solve ELK, we can get the AI to tell us something that doesn’t really correspond to the actual internal knowledge inside the model. This is the “yup, it’s just thinking about what text typically follows this question” scenario.
(3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn’t help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences really are.
There are failures of this type that occur because the AI could have figured out its impact, but it was negligent. This is the “Hawaii Chaff Flower” scenario.
There are failures of this type that occur even if the AI tried its hardest to prevent harm to humans. These failures seem basically unavoidable even if alignment is perfectly solved, so this is mostly outside the realm of alignment.
Theseposts are also vaguely related to the idea discussed in the OP (mostly looking at the problem of oversight being hard because of consequences in the world being hard to predict).
Seems like there are multiple possibilities here:
(1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
(2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn’t really care. (This is within the “without ever explicitly thinking about the fact that humans are resisting it” scenario.) This is isomorphic to ELK.
If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the “oh yeah I am definitely a paperclipper” scenario.
Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario).
If we can’t solve ELK, we can get the AI to tell us something that doesn’t really correspond to the actual internal knowledge inside the model. This is the “yup, it’s just thinking about what text typically follows this question” scenario.
(3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn’t help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences really are.
There are failures of this type that occur because the AI could have figured out its impact, but it was negligent. This is the “Hawaii Chaff Flower” scenario.
There are failures of this type that occur even if the AI tried its hardest to prevent harm to humans. These failures seem basically unavoidable even if alignment is perfectly solved, so this is mostly outside the realm of alignment.
These posts are also vaguely related to the idea discussed in the OP (mostly looking at the problem of oversight being hard because of consequences in the world being hard to predict).