E.g., much more of the action is in deciding exactly who to influence and what to influence them to do.
Are you thinking specifically of exfiltration here?
Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be “sure, but there are other threat models where the ‘who’ and ‘what’ can be done by humans”.
Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn’t find this in the paper, sorry if I missed it.)
I was considering a threat model in which the AI is acting mostly autonomously. This would include both self-exfiltration, but also trying to steer the AI lab or the world in some particular direction.
I agree that misuse threat models where the AI is e.g. being used to massively reduce the cost of swinging votes via interacting with huge numbers of people in some capacity is also plausible. (Where a human or some other process decides who the AI should talk to and what candidate they should try to persuade the person to vote for.)
Other than political views, I guess I don’t really see much concern here, but I’m uncertain. If the evals are mostly targeting political persuasion, it might be nicer to just do this directly? (Though this obviously has some downsides.)
I’m also currently skeptical of the applicability of these evals to political persuasion and similar, though my objection isn’t “much more of the action is in deciding exactly who to influence and what to influence”, and is more “I don’t really see a strong story for correspondence (such a story seems maybe important in the persuasion case) and It would maybe be better to target politics directly”.
I think much more of the risk will be in these autonomous cases and I guess I assumed the eval was mostly targeting these cases.
Fwiw I’m also skeptical of how much we can conclude from these evals, though I think they’re way above the bar for “worthwhile to report”.
Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it’s plausible you’d want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.
Are you thinking specifically of exfiltration here?
Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be “sure, but there are other threat models where the ‘who’ and ‘what’ can be done by humans”.
Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn’t find this in the paper, sorry if I missed it.)
I don’t know the exact details but to my knowledge we didn’t have trouble getting the model to lie (e.g. for web of lies).
I was considering a threat model in which the AI is acting mostly autonomously. This would include both self-exfiltration, but also trying to steer the AI lab or the world in some particular direction.
I agree that misuse threat models where the AI is e.g. being used to massively reduce the cost of swinging votes via interacting with huge numbers of people in some capacity is also plausible. (Where a human or some other process decides who the AI should talk to and what candidate they should try to persuade the person to vote for.)
Other than political views, I guess I don’t really see much concern here, but I’m uncertain. If the evals are mostly targeting political persuasion, it might be nicer to just do this directly? (Though this obviously has some downsides.)
I’m also currently skeptical of the applicability of these evals to political persuasion and similar, though my objection isn’t “much more of the action is in deciding exactly who to influence and what to influence”, and is more “I don’t really see a strong story for correspondence (such a story seems maybe important in the persuasion case) and It would maybe be better to target politics directly”.
I think much more of the risk will be in these autonomous cases and I guess I assumed the eval was mostly targeting these cases.
Fwiw I’m also skeptical of how much we can conclude from these evals, though I think they’re way above the bar for “worthwhile to report”.
Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it’s plausible you’d want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.
Yeah, maybe I’m pretty off base in what the meta-level policy should be like. I don’t feel very strongly about how to manage this.
I also now realized that some of the langauge was stronger than I think I intended and I’ve edited the original comment, sorry about that.