Good point, and I was conflicted whether to put my thoughts about this at the end of the post. My best theory is that increased persuasion abilities looks something like “totalitarian government agents doing solid scaffolding on open-source models to DM people on Facebook”. We will see that persuasive agents get better, but not know why and how. As stated in the introduction, persuasion detection is dangerous, but one of the few capabilities that could also be used defensively (i.e. detecting persuasion in an incoming email → displaying warning in UI and offer to rephrase).
In conclusion, definitely agree that we should consider closed-sourcing any improvements upon the above baseline and only show them to safety orgs instead. Some people at AISI I have talked to while working on persuasion are probably interested in this.
I do think there is a lot of value for being able to have a precise and objective measure of capabilities that are potentially of concern. I also agree that many such evals are unsafe to publish publicly, and should be kept private.
Good point, and I was conflicted whether to put my thoughts about this at the end of the post. My best theory is that increased persuasion abilities looks something like “totalitarian government agents doing solid scaffolding on open-source models to DM people on Facebook”. We will see that persuasive agents get better, but not know why and how. As stated in the introduction, persuasion detection is dangerous, but one of the few capabilities that could also be used defensively (i.e. detecting persuasion in an incoming email → displaying warning in UI and offer to rephrase).
In conclusion, definitely agree that we should consider closed-sourcing any improvements upon the above baseline and only show them to safety orgs instead. Some people at AISI I have talked to while working on persuasion are probably interested in this.
I do think there is a lot of value for being able to have a precise and objective measure of capabilities that are potentially of concern. I also agree that many such evals are unsafe to publish publicly, and should be kept private.