An Opinionated Evals Reading List

Link post

While you can make a lot of progress in evals with tinkering and paying little attention to the literature, we found that various other papers have saved us many months of research effort. The Apollo Research evals team thus compiled a list of what we felt were important evals-related papers. We likely missed some relevant papers, and our recommendations reflect our personal opinions.

Our favorite papers

  • Evaluating Frontier Models for Dangerous Capabilities (Phuong et al., 2024)

    • Contains detailed descriptions of multiple LM agent evals across four categories. Also explores new methodologies for estimating evals success probabilities.

    • We think it is the best “all around” evals paper, i.e. giving the best understanding of what frontier LM agent evals look like

    • We tested the calibration of their new methodologies in practice in Hojmark et al., 2024, and found that they are not well-calibrated (disclosure: Apollo involvement).

  • Observational Scaling Laws and the Predictability of Language Model Performance (Ruan et al., 2024)

    • They find that it is possible to find a low-rank decomposition of models’ capabilities from observed benchmark performances. These can be used to predict the performance of bigger models in the same family.

    • Marius: I think this is the most exciting “science of evals” paper to date. It made me more optimistic about predicting the performance of future models on individual tasks.

  • The Llama 3 Herd of Models (Meta, 2024)

    • Describes the training procedure of the Llama 3.1 family in detail

    • We think this is the most detailed description of how state-of-the-art LLMs are trained to date, and it provides a lot of context that is helpful background knowledge for any kind of evals work.

  • Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022)

    • Shows how to use LLMs to automatically create large evals datasets. Creates 154 benchmarks on different topics. We think this idea has been highly influential and thus highlight the paper.

    • The original paper used Claude-0.5 to generate the datasets, meaning the resulting data is not very high quality. Also, the methodology section of the paper is much more confusingly written than it needs to be.

    • For an improved methodology and pipeline for model-written evals, see Dev et al., 2024 or ARENA chapter 3.2 (disclosure: Apollo involvement).

  • Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023)

    • Introduces LM agent evals for model autonomy. It’s the first paper that rigorously evaluated LM agents for risks related to loss of control, thus worth highlighting.

    • We recommend reading the Appendix as a starting point for understanding agent-based evaluations.

Other evals-related publications

LM agents

Core:

Other:

Benchmarks

Core:

Other:

Science of evals

Core:

Other:

Software

Core:

  • Inspect

    • Open source evals library designed and maintained by UK AISI and spearheaded by JJ Allaire, who intends to develop and support the framework for many years.

    • Supports a wide variety of types of evals, including MC benchmarks and LM agent settings.

  • Vivaria

    • METR’s open-sourced evals tool for LM agents

    • Especially optimized for LM agent evals and the METR task standard

  • Aider

    • Probably the most used open-source coding assistant

    • We recommend using it to speed up your coding

Other:

Miscellaneous

Core:

Other:

Related papers from other fields

Red teaming

Core:

Other:

Scalable oversight

Core:

Other:

Scaling laws & emergent behaviors

Core:

Other:

Science tutorials

Core:

  • Research as a Stochastic Decision Process (Steinhardt)

    • Argues that you should do experiments in the order that maximizes information gained.

    • We use this principle all the time and think it’s very important.

  • Tips for Empirical Alignment Research (Ethan Perez, 2024),

    • Detailed description of what success in empirical alignment research can look like

    • We think it’s a great resource and aligns well with our own approach.

  • You and your Research (Hamming, 1986)

    • Famous classic by Hamming. “What are the important problems of your field? And why are you not working on them?”

Other:

LLM capabilities

Core:

Other:

LLM steering

RLHF

Core:

Other:

Supervised Finetuning/​Training & Prompting

Core:

Other:

Fairness, bias, and accountability

AI Governance

Core:

Other:

Contributions

The first draft of the list was based on a combination of various other reading lists that Marius Hobbhahn and Jérémy Scheurer had previously written. Marius wrote most of the final draft with detailed input from Jérémy and high-level input from Mikita Balesni, Rusheb Shah, and Alex Meinke.