List of commonly used benchmarks for LLMs

I am compiling a list of tasks and evaluations that are used to test LLMs. I intend to expand this list to include the initial published date, scope of questions, number of questions, direct links to the data-sets, question types (ie, multiple choice or fill-in the missing word, etc), along with additional comments on perceived difficulty and other characteristics. The majority of automated test suites rely on multiple choice answer prompts, as open question free-form questionnaires are difficult to evaluate.

AI2 Reasoning Challenge (ARC) 2018: https://allenai.org/data/arc

RACE: https://www.cs.cmu.edu/~glai1/data/race/

More tests are enumerated on page 27 in this paper: more on “A Survey of Large Language Models” https://arxiv.org/pdf/2303.18223.pdf

Additionally, the OpenAI team has used these human oriented tests for their evaluation of GPT4:

Uniform Bar Exam (MBE+MEE+MPT)

LSAT

SAT Evidence-Based Reading & Writing

SAT Math

Graduate Record Examination (GRE) Quantitative

Graduate Record Examination (GRE) Verbal

Graduate Record Examination (GRE) Writing

USABO Semifinal Exam 2020

USNCO Local Section Exam 2022

Medical Knowledge Self-Assessment Program

Codeforces Rating

AP Art History

AP Biology

AP Calculus BC

AP Chemistry

AP English Language and Composition

AP English Literature and Composition

AP Environmental Science

AP Macroeconomics

AP Microeconomics

AP Physics 2

AP Psychology

AP Statistics

AP US Government

AP US History

AP World History

AMC 103

AMC 123

Introductory Sommelier (theory knowledge)

Certified Sommelier (theory knowledge)

Advanced Sommelier (theory knowledge)

Leetcode (easy)

Leetcode (medium)

Leetcode (hard)