My first thought when reading this was ‘huh, no wonder they’re getting mixed results—they’re doing it wrong’.
My second thought when returning to this a day later: good—anything I do to contribute to the ability to understand and measure persuasion is literally directly contributing to dangerous capabilities.
Counterfactually, if we don’t create evals for this… are we not expected to notice that LLMs are becoming increasingly more persuasive? More able to model and predict human psychology?
What is actually the ‘safety’ case for this research? What theory of change predicts this work will be net positive?
Good point, and I was conflicted whether to put my thoughts about this at the end of the post. My best theory is that increased persuasion abilities looks something like “totalitarian government agents doing solid scaffolding on open-source models to DM people on Facebook”. We will see that persuasive agents get better, but not know why and how. As stated in the introduction, persuasion detection is dangerous, but one of the few capabilities that could also be used defensively (i.e. detecting persuasion in an incoming email → displaying warning in UI and offer to rephrase).
In conclusion, definitely agree that we should consider closed-sourcing any improvements upon the above baseline and only show them to safety orgs instead. Some people at AISI I have talked to while working on persuasion are probably interested in this.
I do think there is a lot of value for being able to have a precise and objective measure of capabilities that are potentially of concern. I also agree that many such evals are unsafe to publish publicly, and should be kept private.
I am very confused.
My first thought when reading this was ‘huh, no wonder they’re getting mixed results—they’re doing it wrong’.
My second thought when returning to this a day later: good—anything I do to contribute to the ability to understand and measure persuasion is literally directly contributing to dangerous capabilities.
Counterfactually, if we don’t create evals for this… are we not expected to notice that LLMs are becoming increasingly more persuasive? More able to model and predict human psychology?
What is actually the ‘safety’ case for this research? What theory of change predicts this work will be net positive?
Good point, and I was conflicted whether to put my thoughts about this at the end of the post. My best theory is that increased persuasion abilities looks something like “totalitarian government agents doing solid scaffolding on open-source models to DM people on Facebook”. We will see that persuasive agents get better, but not know why and how. As stated in the introduction, persuasion detection is dangerous, but one of the few capabilities that could also be used defensively (i.e. detecting persuasion in an incoming email → displaying warning in UI and offer to rephrase).
In conclusion, definitely agree that we should consider closed-sourcing any improvements upon the above baseline and only show them to safety orgs instead. Some people at AISI I have talked to while working on persuasion are probably interested in this.
I do think there is a lot of value for being able to have a precise and objective measure of capabilities that are potentially of concern. I also agree that many such evals are unsafe to publish publicly, and should be kept private.