Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that’s probably actually desirable from my perspective even if I am deceived at times.
That’s one possibility yes. It does understand humans pretty well when trained on all our data. But...
a) it doesn’t have to be. We should assume some will be and some will be trained in other ways, such as simulations and synthetic data.
b) if a bad actor RLHFs the model into being actively evil, a terrorist seeking to harm the world, the model will go along with that. Understanding human ethics does not prevent this.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there’s progress being made in making such efforts more effort-intensive.)
Yes, I agree Ann. Perhaps I didn’t make my point clear enough. I believe that we are currently in a gravely offense-dominant situation as a society. We are at great risk from technology such as biological weapons. As AI gets more powerful, and our technology advances, it gets easier and easier for a single bad actor to cause great harm, unless we take preventative measures ahead of time.
Similarly, once AI is powerful enough to enable recursive self-improvement cheaply and easily, then a single bad actor can throw caution to the wind and turn the accelerator up to max. Even if the big labs act cautiously, unless they do something to prevent the rest of the world from developing the same technology, eventually it will spread widely.
Thus, the concerns I’m expressing are about how to deal with points of failure, from a security point of view. This is a very different concern than worrying about whether the median case will go well.
I have been following the progress in adding resistance to harm-enabling fine-tuning. I am glad someone is working on it, but it seems very far from useful yet. I don’t think that that will be sufficient to prevent the sort of harms I’m worried about, for a variety of reasons. It is, perhaps, a useful contribution to a ‘swiss cheese defense’. Also, if ideas like this succeed and are widely adopted, they might at least slow down bad actors and raise the cost of doing harm. Slightly slowing and raising the cost of doing harm is not very reassuring when we are talking about devastating civilization level harms.
Case 4 does include the subset that the model trained on a massive amount of human culture and mimetics develops human-aligned goals that are better than anything specifically aimed at by the developer or instructed by the user. If I want my model to be helpful and nice to people, and the model solves this through RLAIF by vowing to help all beings achieve enlightenment and escape suffering as a self-set deeper goal, that’s probably actually desirable from my perspective even if I am deceived at times.
That’s one possibility yes. It does understand humans pretty well when trained on all our data. But...
a) it doesn’t have to be. We should assume some will be and some will be trained in other ways, such as simulations and synthetic data.
b) if a bad actor RLHFs the model into being actively evil, a terrorist seeking to harm the world, the model will go along with that. Understanding human ethics does not prevent this.
b) here is fully general to all cases, you can train a perfectly corrigible model to refuse instructions instead. (Though there’s progress being made in making such efforts more effort-intensive.)
Yes, I agree Ann. Perhaps I didn’t make my point clear enough. I believe that we are currently in a gravely offense-dominant situation as a society. We are at great risk from technology such as biological weapons. As AI gets more powerful, and our technology advances, it gets easier and easier for a single bad actor to cause great harm, unless we take preventative measures ahead of time.
Similarly, once AI is powerful enough to enable recursive self-improvement cheaply and easily, then a single bad actor can throw caution to the wind and turn the accelerator up to max. Even if the big labs act cautiously, unless they do something to prevent the rest of the world from developing the same technology, eventually it will spread widely.
Thus, the concerns I’m expressing are about how to deal with points of failure, from a security point of view. This is a very different concern than worrying about whether the median case will go well.
I have been following the progress in adding resistance to harm-enabling fine-tuning. I am glad someone is working on it, but it seems very far from useful yet. I don’t think that that will be sufficient to prevent the sort of harms I’m worried about, for a variety of reasons. It is, perhaps, a useful contribution to a ‘swiss cheese defense’. Also, if ideas like this succeed and are widely adopted, they might at least slow down bad actors and raise the cost of doing harm. Slightly slowing and raising the cost of doing harm is not very reassuring when we are talking about devastating civilization level harms.