Google and Anthropic are doing good risk assessment for dangerous capabilities.
My guess is that the current level of evaluations at Google and Anthropic isn’t terrible, but could be massively improved.
Elicitation isn’t well specified and there isn’t an established science.
Let alone reliable projections about the state of elicitation in a few years.
We have a tiny number of different task in these evaluations suites and no readily available mechanism for noticing the models are very good at specific subcategories or tasks in ways which are concerning.
I had some specific minor to moderate level issues with the Google DC evals suite. Edit: I discuss my issues with the evals suite here.
My guess is that the current level of evaluations at Google and Anthropic isn’t terrible, but could be massively improved.
Elicitation isn’t well specified and there isn’t an established science.
Let alone reliable projections about the state of elicitation in a few years.
We have a tiny number of different task in these evaluations suites and no readily available mechanism for noticing the models are very good at specific subcategories or tasks in ways which are concerning.
I had some specific minor to moderate level issues with the Google DC evals suite. Edit: I discuss my issues with the evals suite here.