Standard xrisk arguments generally don’t extrapolate down to systems that don’t solve tasks that require instrumental goals. I think it’s reasonable to say common LLMs don’t exhibit many instrumental goals, but they also can’t solve for long-horizon goal-directed problem solving.
Prosaic risks like biorisk evals often go further and ask, if we assume the AI systems aren’t themselves very capable at this task, can we still exhibit dangerous behaviors from them ‘in the loop’? These are legitimate and interesting questions but they are a different thing.
Consider as a near-limiting case, imagine an engine that before the game began flipped a coin. On heads, it plays the game as Stockfish. On tails, it plays the game as Worstfish. What is this engine’s ELO?
I’m short of time to go into details but this should help illustrate why one should be careful treating ELO as a well defined space rather than a local approximation that’s empirically useful for computationally-limited players.