(I work on the Alignment Stress-Testing team at Anthropic and have been involved in the RSP update and implementation process.)
Re not believing Anthropic’s statement:
we believe the risk of substantial under-elicitation is low
To be more precise: there was significant under-elicitation but the distance to the thresholds was large enough that the risk of crossing them even with better elicitation was low.
(I work on the Alignment Stress-Testing team at Anthropic and have been involved in the RSP update and implementation process.)
Re not believing Anthropic’s statement:
To be more precise: there was significant under-elicitation but the distance to the thresholds was large enough that the risk of crossing them even with better elicitation was low.