There’s still lots and lots of demand for regular capability evaluation, as we keep discovering new issues or LLMs keep tearing through them and rendering them moot, and the cost of creating a meaningful dataset like GPQA keeps skyrocketing (like ~$1000/item) compared to the old days where you could casually Turk your way to questions LLM would fail (like <$1/item). Why think that the dangerous subset would be any different? You think someone is going to come out with a dangerous-capabilities eval in the next year and then that’s it, it’s done, we’ve solved dangerous-capabilities eval, Mission Accomplished?
You think someone is going to come out with a dangerous-capabilities eval in the next year and then that’s it, it’s done, we’ve solved dangerous-capabilities eval, Mission Accomplished?
If it’s well designed and kept private, this doesn’t seem totally implausible to me; e.g. how many ways can you evaluate cyber capabilities to try to asses risks of weights exfiltration or taking over the datacenter (in a control framework)? Surely that’s not an infinite set.
But in any case, it seems pretty obvious that the returns should be quickly diminishing on e.g. the 100th set of DC evals vs. e.g. the 2nd set of alignment evals / 1st set of auto AI safety R&D evals.
It’s not an infinite set and returns diminish, but that’s true of regular capabilities too, no? And you can totally imagine new kinds of dangerous capabilities; every time LLMs gain a new modality or data source, they get a new set of vulnerabilities/dangers. For example, once Sydney went live, you had a whole new kind of dangerous capability in terms of persisting knowledge/attitudes across episodes of unrelated users by generating transcripts which would become available by Bing Search. This would have been difficult to test before, and no one experimented with it AFAIK. But after seeing indirect prompt injections in the wild and possible amplification of the Sydney personae, now suddenly people start caring about this once-theoretical possibility and might start evaluating it. (This is also a reason why returns don’t diminish as much, because benchmarks ‘rot’: quite aside from intrinsic temporal drift and ceiling issues and new areas of dangers opening up, there’s leakage, which is just as relevant to dangerous capabilities as regular capabilities—OK, you RLHFed your model to not provide methamphetamine recipes and this has become a standard for release, but it only works on meth recipes and not other recipes because no one in your org actually cares and did only the minimum RLHF to pass the eval and provide the execs with excuses, and you used off-the-shelf preference datasets designed to do the minimum, like Facebook releasing Llama… Even if it’s not leakage in the most literal sense of memorizing the exact wording of a question, there’s still ‘meta-leakage’ of overfitting to that sort of question.)
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?
There’s still lots and lots of demand for regular capability evaluation, as we keep discovering new issues or LLMs keep tearing through them and rendering them moot, and the cost of creating a meaningful dataset like GPQA keeps skyrocketing (like ~$1000/item) compared to the old days where you could casually Turk your way to questions LLM would fail (like <$1/item). Why think that the dangerous subset would be any different? You think someone is going to come out with a dangerous-capabilities eval in the next year and then that’s it, it’s done, we’ve solved dangerous-capabilities eval, Mission Accomplished?
If it’s well designed and kept private, this doesn’t seem totally implausible to me; e.g. how many ways can you evaluate cyber capabilities to try to asses risks of weights exfiltration or taking over the datacenter (in a control framework)? Surely that’s not an infinite set.
But in any case, it seems pretty obvious that the returns should be quickly diminishing on e.g. the 100th set of DC evals vs. e.g. the 2nd set of alignment evals / 1st set of auto AI safety R&D evals.
It’s not an infinite set and returns diminish, but that’s true of regular capabilities too, no? And you can totally imagine new kinds of dangerous capabilities; every time LLMs gain a new modality or data source, they get a new set of vulnerabilities/dangers. For example, once Sydney went live, you had a whole new kind of dangerous capability in terms of persisting knowledge/attitudes across episodes of unrelated users by generating transcripts which would become available by Bing Search. This would have been difficult to test before, and no one experimented with it AFAIK. But after seeing indirect prompt injections in the wild and possible amplification of the Sydney personae, now suddenly people start caring about this once-theoretical possibility and might start evaluating it. (This is also a reason why returns don’t diminish as much, because benchmarks ‘rot’: quite aside from intrinsic temporal drift and ceiling issues and new areas of dangers opening up, there’s leakage, which is just as relevant to dangerous capabilities as regular capabilities—OK, you RLHFed your model to not provide methamphetamine recipes and this has become a standard for release, but it only works on meth recipes and not other recipes because no one in your org actually cares and did only the minimum RLHF to pass the eval and provide the execs with excuses, and you used off-the-shelf preference datasets designed to do the minimum, like Facebook releasing Llama… Even if it’s not leakage in the most literal sense of memorizing the exact wording of a question, there’s still ‘meta-leakage’ of overfitting to that sort of question.)
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?