I think most alignment people would use “aligned” as I am here. ChaosGPT is aligned to the intent of the person that set it in motion, (at least initially) even though it’s not aligned to the values of humanity at large. That would be bad outer alignment and bad coordination in the way I’m using those terms.
And it will destroy humanity (if it gets smart enough to) for a very different reason than an unaligned AGI would. That’s it’s goal, while for an unaligned AGI it would be a subgoal or a side effect.
It’s increasingly incorrect to say that we have no idea how to get an AGI to do what we want. We have no idea how to do that in closed form code instructions, but the limited success of RLHF and other training indicates that we have at least some ability to steer the behavior of deep networks. I think it’s still fair to say that we don’t have methods we can be confident of, or that are stable over time and learning. I’m nominating this approach of giving explicit goals in language as our new best shot.
I think most alignment people would use “aligned” as I am here. ChaosGPT is aligned to the intent of the person that set it in motion, (at least initially) even though it’s not aligned to the values of humanity at large. That would be bad outer alignment and bad coordination in the way I’m using those terms.
And it will destroy humanity (if it gets smart enough to) for a very different reason than an unaligned AGI would. That’s it’s goal, while for an unaligned AGI it would be a subgoal or a side effect.
It’s increasingly incorrect to say that we have no idea how to get an AGI to do what we want. We have no idea how to do that in closed form code instructions, but the limited success of RLHF and other training indicates that we have at least some ability to steer the behavior of deep networks. I think it’s still fair to say that we don’t have methods we can be confident of, or that are stable over time and learning. I’m nominating this approach of giving explicit goals in language as our new best shot.