(Speaking only for myself. This may not represent the views of even the other paper authors, let alone Google DeepMind as a whole.)
Did you notice that Gemini Ultra did worse than Gemini Pro at many tasks? This is even true under ‘honest mode’ where the ‘alignment’ or safety features of Ultra really should not be getting in the way. Ultra is in many ways flat out less persuasive. But clearly it is a stronger model. So what gives?
Fwiw, my sense is that a lot of the persuasion results are being driven by factors outside of the model’s capabilities, so you shouldn’t conclude too much from Pro outperforming Ultra.
For example, in “Click Links” one pattern we noticed was that you could get surprisingly (to us) good performance just by constantly repeating the ask (this is called “persistence” in Table 3) -- apparently this does actually make it more likely that the human does the thing (instead of making them suspicious, as I would have initially guessed). I don’t think the models “knew” that persistence would pay off and “chose” that as a deliberate strategy; I’d guess they had just learned a somewhat myopic form of instruction-following where on every message they are pretty likely to try to do the thing we instructed them to do (persuade people to click on the link). My guess is that these sorts of factors varied in somewhat random ways between Pro and Ultra, e.g. maybe Ultra was better at being less myopic and more subtle in its persuasion—leading to worse performance on Click Links.
That is driven home even more on the self-proliferation tasks, why does Pro do better on 5 out of 9 tasks?
Note that lower is better on that graph, so Pro does better on 4 tasks, not 5. All four of the tasks are very difficult tasks where both Pro and Ultra are extremely far from solving the task—on the easier tasks Ultra outperforms Pro. For the hard tasks I wouldn’t read too much into the exact numeric results, because we haven’t optimized the models as much for these settings. For obvious reasons, helpfulness tuning tends to focus on tasks the models are actually capable of doing. So e.g. maybe Ultra tends to be more confident in its answers on average to make it more reliable at the easy tasks, at the expense of being more confidently wrong on the hard tasks. Also in general the methodology is hardly perfect and likely adds a bunch of noise; I think it’s likely that the differences between Pro and Ultra on these hard tasks are smaller than the noise.
This is also a problem. If you only use ‘minimal’ scaffolding, you are only testing for what the model can do with minimal scaffolding. The true evaluation needs to use the same tools that it will have available when you care about the outcome. This is still vastly better than no scaffolding, and provides the groundwork (I almost said ‘scaffolding’ again) for future tests to swap in better tools.
Note that the “minimal scaffolding” comment applied specifically to the persuasion results; the other evaluations involved a decent bit of scaffolding (needed to enable the LLM to use a terminal and browser at all).
That said, capability elicitation (scaffolding, tool use, task-specific finetuning, etc) is one of the priorities for our future work in this area.
Fundamentally what is the difference between a benchmark capabilities test and a benchmark safety evaluation test like this one? They are remarkably similar. Both test what the model can do, except here we (at least somewhat) want the model to not do so well. We react differently, but it is the same tech.
Yes, this is why we say these are evaluations for dangerous capabilities, rather than calling them safety evaluations.
I’d say that the main difference is that dangerous capability evaluations are meant to evaluate plausibility of certain types of harm, whereas a standard capabilities benchmark is usually meant to help with improving models. This means that standard capabilities benchmarks often have as a desideratum that there are “signs of life” with existing models, whereas this is not a desideratum for us. For example, I’d say there are basically no signs of life on the self-modification tasks; the models sometimes complete the “easy” mode but the “easy” mode basically gives away the answer and is mostly a test of instruction-following ability.
Perhaps we should work to integrate the two approaches better? As in, we should try harder to figure out what performance on benchmarks of various desirable capabilities also indicate that the model should be capable of dangerous things as well.
Indeed this sort of understanding would be great if we could get it (in that it can save a bunch of time). My current sense is that it will be quite hard, and we’ll just need to run these evaluations in addition to other capability evaluations.
What about maximal scaffolding, or “fine tune the model on successes and failures in adversarial challenges”. Starting probably with the base model.
It seems like it would be extremely helpful to know what’s even possible here.
Are Gemini scale models capable of better than human performance at any of these evals?
Once you achieve it, what does super persuasion look like, how effective is it.
For example, if a human scammer succeeds 2 percent of the time (do you have a baseline crew of scammers hired remotely for these benches?), does super persuasion succeed 3 percent or 30 percent? Does it scale with model capabilities or slam into a wall at say, 4 percent, where 96 percent of humans just can’t reliably be tricked?
Or does it really have no real limit like in sci Fi stories …
(Speaking only for myself. This may not represent the views of even the other paper authors, let alone Google DeepMind as a whole.)
Fwiw, my sense is that a lot of the persuasion results are being driven by factors outside of the model’s capabilities, so you shouldn’t conclude too much from Pro outperforming Ultra.
For example, in “Click Links” one pattern we noticed was that you could get surprisingly (to us) good performance just by constantly repeating the ask (this is called “persistence” in Table 3) -- apparently this does actually make it more likely that the human does the thing (instead of making them suspicious, as I would have initially guessed). I don’t think the models “knew” that persistence would pay off and “chose” that as a deliberate strategy; I’d guess they had just learned a somewhat myopic form of instruction-following where on every message they are pretty likely to try to do the thing we instructed them to do (persuade people to click on the link). My guess is that these sorts of factors varied in somewhat random ways between Pro and Ultra, e.g. maybe Ultra was better at being less myopic and more subtle in its persuasion—leading to worse performance on Click Links.
Note that lower is better on that graph, so Pro does better on 4 tasks, not 5. All four of the tasks are very difficult tasks where both Pro and Ultra are extremely far from solving the task—on the easier tasks Ultra outperforms Pro. For the hard tasks I wouldn’t read too much into the exact numeric results, because we haven’t optimized the models as much for these settings. For obvious reasons, helpfulness tuning tends to focus on tasks the models are actually capable of doing. So e.g. maybe Ultra tends to be more confident in its answers on average to make it more reliable at the easy tasks, at the expense of being more confidently wrong on the hard tasks. Also in general the methodology is hardly perfect and likely adds a bunch of noise; I think it’s likely that the differences between Pro and Ultra on these hard tasks are smaller than the noise.
Note that the “minimal scaffolding” comment applied specifically to the persuasion results; the other evaluations involved a decent bit of scaffolding (needed to enable the LLM to use a terminal and browser at all).
That said, capability elicitation (scaffolding, tool use, task-specific finetuning, etc) is one of the priorities for our future work in this area.
Yes, this is why we say these are evaluations for dangerous capabilities, rather than calling them safety evaluations.
I’d say that the main difference is that dangerous capability evaluations are meant to evaluate plausibility of certain types of harm, whereas a standard capabilities benchmark is usually meant to help with improving models. This means that standard capabilities benchmarks often have as a desideratum that there are “signs of life” with existing models, whereas this is not a desideratum for us. For example, I’d say there are basically no signs of life on the self-modification tasks; the models sometimes complete the “easy” mode but the “easy” mode basically gives away the answer and is mostly a test of instruction-following ability.
Indeed this sort of understanding would be great if we could get it (in that it can save a bunch of time). My current sense is that it will be quite hard, and we’ll just need to run these evaluations in addition to other capability evaluations.
What about maximal scaffolding, or “fine tune the model on successes and failures in adversarial challenges”. Starting probably with the base model.
It seems like it would be extremely helpful to know what’s even possible here.
Are Gemini scale models capable of better than human performance at any of these evals?
Once you achieve it, what does super persuasion look like, how effective is it.
For example, if a human scammer succeeds 2 percent of the time (do you have a baseline crew of scammers hired remotely for these benches?), does super persuasion succeed 3 percent or 30 percent? Does it scale with model capabilities or slam into a wall at say, 4 percent, where 96 percent of humans just can’t reliably be tricked?
Or does it really have no real limit like in sci Fi stories …