I’m not sure how serious this suggestion is, but note that:
It involves first training a model to be evil, running it, and hoping that you are good enough at jailbreaking to make it good rather than make it pretend to be good. And then to somehow have that be stable.
The opposite of something really bad is not necessarily good. E.g., the opposite of a paperclip maximiser is… I guess a paperclip minimiser? That seems approximately as bad.
I’m not sure how serious this suggestion is, but note that:
It involves first training a model to be evil, running it, and hoping that you are good enough at jailbreaking to make it good rather than make it pretend to be good. And then to somehow have that be stable.
The opposite of something really bad is not necessarily good. E.g., the opposite of a paperclip maximiser is… I guess a paperclip minimiser? That seems approximately as bad.