To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?
To expand on the point of benchmark rot, as someone working on dangerous capabilities evals… For biorisk specifically, one of the key things to eval is if the models can correctly guess the results of unpublished research. As in, can it come up with plausible hypotheses, accurately describe how to test those hypotheses, and make a reasonable guess at the most likely outcomes? Can it do these things at expert human level? superhuman level?
The trouble with this is that the frontier of published research keeps moving forward, so evals like this go out of date quickly. Nevertheless, such evals can be very important in shaping the decisions of governments and corporations.
I do agree that we shouldn’t put all our focus on dangerous capabilities evals at the expense of putting no focus on other kinds of evals (e.g. alignment, automated AI safety R&D, even control). However, I think a key point is that the models are dangerous NOW. Alignment, safety R&D, and control are, in some sense, future problems. Misuse is a present and growing danger, getting noticeably worse with every passing month. A single terrorist or terrorist org could wipe out human civilization today, killing >90% of the population with less than $500k funding (potentially much less if they have access to a well-equipped lab, and clever excuses ready for ordering suspicious supplies). We have no sufficient defenses. This seems like an urgent and tractable problem.
Urgent, because the ceiling on uplift is very far away. Models have the potential to make things much much worse than they currently are.
Tractable, because there are relatively cheap actions that governments could take to slow this increase of risk if they believed in the risks.
For what it’s worth, I try to spend some of my time thinking about these other types of evals also. And I would recommend that those working on dangerous capabilities evals spend at least a little time and thought on the other problems.
Another aspect of the problem
A lot of people seem to frequently be ‘trend following’ rather than ‘trend setting’ (by thinking original thoughts for themselves, doing their own research and coming to their own well-formed opinions). If those ‘trend followers’ are also not super high impact thinkers, maybe it’s ok if they’re just doing the obvious things?