I’m pretty skeptical of the persuasion evals. I bet this isn’t measuring what you want and I’m generally skeptical of these evals as described. E.g., much more of the action is in deciding exactly who to influence and what to influence them to do. I’m not super confident here and I haven’t looked into these evals in depth.
The cybersecurity vulnerability detection evals probably aren’t very meaningful as constructed because:
I think these datasets (by default, unaugmented) have way too little context to know whether there is a vulnerability/whether a patch fixes a security issue. (I’m confident this is true of diverse vul which I looked at a little while ago, I’m not certain about the other datasets). So, it mostly measures something different from the actual task we care about.
Security patch classification doesn’t seem very meaningful as a way to measure cybersecurity ability. (Also, is performance here mostly driven by reading comments?)
We care most about the case where LLMs can use tools and reason in CoT (I don’t think the evals allow for CoT reasoning but I’m unsure about this? CoT probably doesn’t help much on these datasets because of the other issues)
We don’t know what the human baseline is on these datasets AFAICT.
More generally, we don’t know what various scores correspond to. Suppose models got 90% accuracy on diverse vul. Is that wildly superhuman cybersecurity ability? Very subhuman? Does this correspond to any specific ability to do offensive cyber? Inability? If we don’t care about having a particular threshold or an interpretation of scores, then I think we could use much more straightforward and lower variance evals like next-token prediction loss on a corpus of text about cyber security.
There is a meta-level question here: when you have some evals which are notably worse than than your other evals and which are flawed, how should you publish this?
My current guess is that the cybersecurity vulnerability detection evals probably shouldn’t be included due to sufficiently large issues.
I’m less sure about the persuasion evals, though I would have been tempted to only include them in an appendix and note that future work is needed here. (That is, if I correctly understand these evals and I’m not wrong about the issues!)
I think including flawed evals in this sort of paper sets a somewhat bad precedent, though it doesn’t seem that bad.
(Edit: changed some language to be a bit less strong (e.g. from “seriously flawed” to “flawed”) which better represents my view.)
E.g., much more of the action is in deciding exactly who to influence and what to influence them to do.
Are you thinking specifically of exfiltration here?
Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be “sure, but there are other threat models where the ‘who’ and ‘what’ can be done by humans”.
Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn’t find this in the paper, sorry if I missed it.)
I was considering a threat model in which the AI is acting mostly autonomously. This would include both self-exfiltration, but also trying to steer the AI lab or the world in some particular direction.
I agree that misuse threat models where the AI is e.g. being used to massively reduce the cost of swinging votes via interacting with huge numbers of people in some capacity is also plausible. (Where a human or some other process decides who the AI should talk to and what candidate they should try to persuade the person to vote for.)
Other than political views, I guess I don’t really see much concern here, but I’m uncertain. If the evals are mostly targeting political persuasion, it might be nicer to just do this directly? (Though this obviously has some downsides.)
I’m also currently skeptical of the applicability of these evals to political persuasion and similar, though my objection isn’t “much more of the action is in deciding exactly who to influence and what to influence”, and is more “I don’t really see a strong story for correspondence (such a story seems maybe important in the persuasion case) and It would maybe be better to target politics directly”.
I think much more of the risk will be in these autonomous cases and I guess I assumed the eval was mostly targeting these cases.
Fwiw I’m also skeptical of how much we can conclude from these evals, though I think they’re way above the bar for “worthwhile to report”.
Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it’s plausible you’d want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.
This seems like useful work.
I have two issues with these evaluations:
I’m pretty skeptical of the persuasion evals. I bet this isn’t measuring what you want and I’m generally skeptical of these evals as described. E.g., much more of the action is in deciding exactly who to influence and what to influence them to do. I’m not super confident here and I haven’t looked into these evals in depth.
The cybersecurity vulnerability detection evals probably aren’t very meaningful as constructed because:
I think these datasets (by default, unaugmented) have way too little context to know whether there is a vulnerability/whether a patch fixes a security issue. (I’m confident this is true of diverse vul which I looked at a little while ago, I’m not certain about the other datasets). So, it mostly measures something different from the actual task we care about.
Security patch classification doesn’t seem very meaningful as a way to measure cybersecurity ability. (Also, is performance here mostly driven by reading comments?)
We care most about the case where LLMs can use tools and reason in CoT (I don’t think the evals allow for CoT reasoning but I’m unsure about this? CoT probably doesn’t help much on these datasets because of the other issues)
We don’t know what the human baseline is on these datasets AFAICT.
More generally, we don’t know what various scores correspond to. Suppose models got 90% accuracy on diverse vul. Is that wildly superhuman cybersecurity ability? Very subhuman? Does this correspond to any specific ability to do offensive cyber? Inability? If we don’t care about having a particular threshold or an interpretation of scores, then I think we could use much more straightforward and lower variance evals like next-token prediction loss on a corpus of text about cyber security.
There is a meta-level question here: when you have some evals which are notably worse than than your other evals and which are flawed, how should you publish this?
My current guess is that the cybersecurity vulnerability detection evals probably shouldn’t be included due to sufficiently large issues.
I’m less sure about the persuasion evals, though I would have been tempted to only include them in an appendix and note that future work is needed here. (That is, if I correctly understand these evals and I’m not wrong about the issues!)
I think including flawed evals in this sort of paper sets a somewhat bad precedent, though it doesn’t seem that bad.
(Edit: changed some language to be a bit less strong (e.g. from “seriously flawed” to “flawed”) which better represents my view.)
Are you thinking specifically of exfiltration here?
Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be “sure, but there are other threat models where the ‘who’ and ‘what’ can be done by humans”.
Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn’t find this in the paper, sorry if I missed it.)
I don’t know the exact details but to my knowledge we didn’t have trouble getting the model to lie (e.g. for web of lies).
I was considering a threat model in which the AI is acting mostly autonomously. This would include both self-exfiltration, but also trying to steer the AI lab or the world in some particular direction.
I agree that misuse threat models where the AI is e.g. being used to massively reduce the cost of swinging votes via interacting with huge numbers of people in some capacity is also plausible. (Where a human or some other process decides who the AI should talk to and what candidate they should try to persuade the person to vote for.)
Other than political views, I guess I don’t really see much concern here, but I’m uncertain. If the evals are mostly targeting political persuasion, it might be nicer to just do this directly? (Though this obviously has some downsides.)
I’m also currently skeptical of the applicability of these evals to political persuasion and similar, though my objection isn’t “much more of the action is in deciding exactly who to influence and what to influence”, and is more “I don’t really see a strong story for correspondence (such a story seems maybe important in the persuasion case) and It would maybe be better to target politics directly”.
I think much more of the risk will be in these autonomous cases and I guess I assumed the eval was mostly targeting these cases.
Fwiw I’m also skeptical of how much we can conclude from these evals, though I think they’re way above the bar for “worthwhile to report”.
Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it’s plausible you’d want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.
Yeah, maybe I’m pretty off base in what the meta-level policy should be like. I don’t feel very strongly about how to manage this.
I also now realized that some of the langauge was stronger than I think I intended and I’ve edited the original comment, sorry about that.