I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it’s too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that “most” minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.
I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are:
It may be useful to distinguish between “power-seeking behavior” and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior—it’s akin to avoiding “State 2″ in your diagram—but it is something that we’d want any good house-cleaning robot to do. It’s only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can’t shut you off.)
This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and “power-seeking behavior” (in the context of some particular MDP) does reduce the ominousness of the result to some degree.
Ultimately, for most real-world tasks, I think it’s unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM “scalable agent alignment” paper, Rohin’s “narrow value learning” sequence, and elsewhere: that, over time, the RL development process will necessarily look less-and-less like “pick a reward function and then let an RL algorithm run until you get a policy that optimizes the reward function sufficiently well.” There’s seemingly just not that much that you can do using hand-written reward functions. I think that these more sophisticated training processes will probably be pretty strongly attracted toward non-omnicidal policies. At a higher level, engineers will also be attracted toward using training processes that produce benign/useful policies. They should have at least some ability to notice or foresee issues with classes of training processes, before any of them are used to produce systems that are willing and able to commit omnicide. Ultimately, in other words, I think it’s reasonable to be optimistic that we’ll do much better than random when producing the policies of advanced AI systems.
I do still think that the ICT is true, though, and I do still think that it matters: it’s (basically) necessary for establishing a high level of misalignment risk. I just don’t think it’s sufficient to establish a high level of risk (and am skeptical of certain other premises that would be sufficient to establish this).
I agree that your paper strengthens the IC (and is also, in general, very cool!). One possible objection to the ICT, as traditionally formulated, has been that it’s too vague: there are lots of different ways you could define a subset of possible minds, and then a measure over that subset, and not all of these ways actually imply that “most” minds in the subset have dangerous properties. Your paper definitely makes the ICT crisper, more clearly true, and more closely/concretely linked to AI development practices.
I still think, though, that the ICT only gets us a relatively small portion of the way to believing that extinction-level alignment failures are likely. A couple of thoughts I have are:
It may be useful to distinguish between “power-seeking behavior” and omnicide (or equivalently harmful behavior). We do want AI systems to pursue power-seeking behaviors, to some extent. Making sure not to lock yourself in the bathroom, for example, qualifies as a power-seeking behavior—it’s akin to avoiding “State 2″ in your diagram—but it is something that we’d want any good house-cleaning robot to do. It’s only a particular subset of power-seeking behavior that we badly want to avoid (e.g. killing people so they can’t shut you off.)
This being said, I imagine that, if we represented the physical universe as an MDP, and defined a reward function over states, and used a sufficiently low discount rate, then the optimal policy for most reward functions probably would involve omnicide. So the result probably does port over to this special case. Still, I think that keeping in mind the distinction between omnicide and “power-seeking behavior” (in the context of some particular MDP) does reduce the ominousness of the result to some degree.
Ultimately, for most real-world tasks, I think it’s unlikely that people will develop RL systems using hand-coded reward functions (and then deploy them). I buy the framing in (e.g.) the DM “scalable agent alignment” paper, Rohin’s “narrow value learning” sequence, and elsewhere: that, over time, the RL development process will necessarily look less-and-less like “pick a reward function and then let an RL algorithm run until you get a policy that optimizes the reward function sufficiently well.” There’s seemingly just not that much that you can do using hand-written reward functions. I think that these more sophisticated training processes will probably be pretty strongly attracted toward non-omnicidal policies. At a higher level, engineers will also be attracted toward using training processes that produce benign/useful policies. They should have at least some ability to notice or foresee issues with classes of training processes, before any of them are used to produce systems that are willing and able to commit omnicide. Ultimately, in other words, I think it’s reasonable to be optimistic that we’ll do much better than random when producing the policies of advanced AI systems.
I do still think that the ICT is true, though, and I do still think that it matters: it’s (basically) necessary for establishing a high level of misalignment risk. I just don’t think it’s sufficient to establish a high level of risk (and am skeptical of certain other premises that would be sufficient to establish this).