It’s not an infinite set and returns diminish, but that’s true of regular capabilities too, no? And you can totally imagine new kinds of dangerous capabilities; every time LLMs gain a new modality or data source, they get a new set of vulnerabilities/dangers. For example, once Sydney went live, you had a whole new kind of dangerous capability in terms of persisting knowledge/attitudes across episodes of unrelated users by generating transcripts which would become available by Bing Search. This would have been difficult to test before, and no one experimented with it AFAIK. But after seeing indirect prompt injections in the wild and possible amplification of the Sydney personae, now suddenly people start caring about this once-theoretical possibility and might start evaluating it. (This is also a reason why returns don’t diminish as much, because benchmarks ‘rot’: quite aside from intrinsic temporal drift and ceiling issues and new areas of dangers opening up, there’s leakage, which is just as relevant to dangerous capabilities as regular capabilities—OK, you RLHFed your model to not provide methamphetamine recipes and this has become a standard for release, but it only works on meth recipes and not other recipes because no one in your org actually cares and did only the minimum RLHF to pass the eval and provide the execs with excuses, and you used off-the-shelf preference datasets designed to do the minimum, like Facebook releasing Llama… Even if it’s not leakage in the most literal sense of memorizing the exact wording of a question, there’s still ‘meta-leakage’ of overfitting to that sort of question.)
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?
It’s not an infinite set and returns diminish, but that’s true of regular capabilities too, no? And you can totally imagine new kinds of dangerous capabilities; every time LLMs gain a new modality or data source, they get a new set of vulnerabilities/dangers. For example, once Sydney went live, you had a whole new kind of dangerous capability in terms of persisting knowledge/attitudes across episodes of unrelated users by generating transcripts which would become available by Bing Search. This would have been difficult to test before, and no one experimented with it AFAIK. But after seeing indirect prompt injections in the wild and possible amplification of the Sydney personae, now suddenly people start caring about this once-theoretical possibility and might start evaluating it. (This is also a reason why returns don’t diminish as much, because benchmarks ‘rot’: quite aside from intrinsic temporal drift and ceiling issues and new areas of dangers opening up, there’s leakage, which is just as relevant to dangerous capabilities as regular capabilities—OK, you RLHFed your model to not provide methamphetamine recipes and this has become a standard for release, but it only works on meth recipes and not other recipes because no one in your org actually cares and did only the minimum RLHF to pass the eval and provide the execs with excuses, and you used off-the-shelf preference datasets designed to do the minimum, like Facebook releasing Llama… Even if it’s not leakage in the most literal sense of memorizing the exact wording of a question, there’s still ‘meta-leakage’ of overfitting to that sort of question.)
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?