Recommended readings for people interested in evals work?
Someone recently asked: “Suppose someone wants to get into evals work. Is there a good reading list to send to them?” I spent ~5 minutes and put this list together. I’d be interested if people have additional suggestions or recommendations:
A paper I’m writing on semi-structured interviews as a good complement to formal evaluations (in-progress)
I would also encourage them to read stuff more on the “macrostrategy” of evals. Like, I suspect a lot of value will come from people who are able to understand the broader theory of change of evals and identify when we’re “rowing” in bad directions. Some examples here might be:
Recommended readings for people interested in evals work?
Someone recently asked: “Suppose someone wants to get into evals work. Is there a good reading list to send to them?” I spent ~5 minutes and put this list together. I’d be interested if people have additional suggestions or recommendations:
I would send them:
Model evaluations for extreme risks
Evaluating frontier models for dangerous capabilities
METR ARA paper
Recent AI Sandbagging paper
Anthropic’s challenges in evaluating AI systems
Apollo’s starter guide for evals
A paper I’m writing on semi-structured interviews as a good complement to formal evaluations (in-progress)
I would also encourage them to read stuff more on the “macrostrategy” of evals. Like, I suspect a lot of value will come from people who are able to understand the broader theory of change of evals and identify when we’re “rowing” in bad directions. Some examples here might be:
How evals might (or might not) prevent catastrophic risks from AI (a bit outdated but still relevant IMO).
Lots of the discussion around RSPs (e.g., RSPs are pauses done right, RSPs are risk management done wrong, OpenAI’s Preparedness Framework: Praise & Recommendations)
A paper I’m writing on emergency preparedness, that includes some thoughts on government’s “detection capabilities” (in-progress).
Six dimensions of operational adequacy (relevant for “what happens when the evals go off”)
Carefully bootstrapped alignment is organizationally hard (also relevant for “what happens when the evals go off”)
I’m obviously biased, but I would recommend my post on macrostrategy of evals: The case for more ambitious language model evals.
@Ryan Kidd @Lee Sharkey I suspect you’ll have useful recommendations here.