One thing that seems worth mentioning is that, based on my understanding of Alignment Theory, if some smarter version of ChaosGPT did kill all humans, it wouldn’t be because of the instructions it was given, but for the same reason any unaligned AI would kill all humans—that is, because it’s unaligned. It’s hard for me to imagine a scenario where an unaligned agent like ChaosGPT would be more likely to kill everyone than any given unaligned AI; the whole deal with the Outer Alignment Problem is that we don’t yet know how to get agents to do the things we want them to do, regardless of whether those things are benevolent or destructive or anything in between.
Still, I agree that this sets a horrible precedent and that this sort of thing should be prosecuted in the future, if only because at some point if we do solve Alignment, an agent like ChaosGPT could be dangerous for (obvious) different reasons, unrelated to being unaligned.
I think most alignment people would use “aligned” as I am here. ChaosGPT is aligned to the intent of the person that set it in motion, (at least initially) even though it’s not aligned to the values of humanity at large. That would be bad outer alignment and bad coordination in the way I’m using those terms.
And it will destroy humanity (if it gets smart enough to) for a very different reason than an unaligned AGI would. That’s it’s goal, while for an unaligned AGI it would be a subgoal or a side effect.
It’s increasingly incorrect to say that we have no idea how to get an AGI to do what we want. We have no idea how to do that in closed form code instructions, but the limited success of RLHF and other training indicates that we have at least some ability to steer the behavior of deep networks. I think it’s still fair to say that we don’t have methods we can be confident of, or that are stable over time and learning. I’m nominating this approach of giving explicit goals in language as our new best shot.
One thing that seems worth mentioning is that, based on my understanding of Alignment Theory, if some smarter version of ChaosGPT did kill all humans, it wouldn’t be because of the instructions it was given, but for the same reason any unaligned AI would kill all humans—that is, because it’s unaligned. It’s hard for me to imagine a scenario where an unaligned agent like ChaosGPT would be more likely to kill everyone than any given unaligned AI; the whole deal with the Outer Alignment Problem is that we don’t yet know how to get agents to do the things we want them to do, regardless of whether those things are benevolent or destructive or anything in between.
Still, I agree that this sets a horrible precedent and that this sort of thing should be prosecuted in the future, if only because at some point if we do solve Alignment, an agent like ChaosGPT could be dangerous for (obvious) different reasons, unrelated to being unaligned.
I think most alignment people would use “aligned” as I am here. ChaosGPT is aligned to the intent of the person that set it in motion, (at least initially) even though it’s not aligned to the values of humanity at large. That would be bad outer alignment and bad coordination in the way I’m using those terms.
And it will destroy humanity (if it gets smart enough to) for a very different reason than an unaligned AGI would. That’s it’s goal, while for an unaligned AGI it would be a subgoal or a side effect.
It’s increasingly incorrect to say that we have no idea how to get an AGI to do what we want. We have no idea how to do that in closed form code instructions, but the limited success of RLHF and other training indicates that we have at least some ability to steer the behavior of deep networks. I think it’s still fair to say that we don’t have methods we can be confident of, or that are stable over time and learning. I’m nominating this approach of giving explicit goals in language as our new best shot.