I am compiling a list of tasks and evaluations that are used to test LLMs. I intend to expand this list to include the initial published date, scope of questions, number of questions, direct links to the data-sets, question types (ie, multiple choice or fill-in the missing word, etc), along with additional comments on perceived difficulty and other characteristics. The majority of automated test suites rely on multiple choice answer prompts, as open question free-form questionnaires are difficult to evaluate.
List of commonly used benchmarks for LLMs
I am compiling a list of tasks and evaluations that are used to test LLMs. I intend to expand this list to include the initial published date, scope of questions, number of questions, direct links to the data-sets, question types (ie, multiple choice or fill-in the missing word, etc), along with additional comments on perceived difficulty and other characteristics. The majority of automated test suites rely on multiple choice answer prompts, as open question free-form questionnaires are difficult to evaluate.
TruthfulQA: https://github.com/sylinrl/TruthfulQA
MMLU: https://github.com/hendrycks/test
HellaSwag: https://github.com/rowanz/hellaswag/tree/master/data
WinoGrande: https://github.com/allenai/winogrande
HumanEval: https://github.com/openai/human-eval
DROP: https://arxiv.org/abs/1903.00161
GSM8K: https://github.com/openai/grade-school-math
LogiQA: https://github.com/lgw863/LogiQA-dataset
CoQA: https://stanfordnlp.github.io/coqa/
LAMBADA: https://zenodo.org/record/2630551#.X4Xzn5NKjUI
ReClor: https://whyu.me/reclor/
BoolQ: https://arxiv.org/abs/1905.10044
PIQA: https://yonatanbisk.com/piqa/
SIQA: https://arxiv.org/abs/1904.09728
AI2 Reasoning Challenge (ARC) 2018: https://allenai.org/data/arc
RACE: https://www.cs.cmu.edu/~glai1/data/race/
More tests are enumerated on page 27 in this paper: more on “A Survey of Large Language Models” https://arxiv.org/pdf/2303.18223.pdf
Additionally, the OpenAI team has used these human oriented tests for their evaluation of GPT4:
Uniform Bar Exam (MBE+MEE+MPT)
LSAT
SAT Evidence-Based Reading & Writing
SAT Math
Graduate Record Examination (GRE) Quantitative
Graduate Record Examination (GRE) Verbal
Graduate Record Examination (GRE) Writing
USABO Semifinal Exam 2020
USNCO Local Section Exam 2022
Medical Knowledge Self-Assessment Program
Codeforces Rating
AP Art History
AP Biology
AP Calculus BC
AP Chemistry
AP English Language and Composition
AP English Literature and Composition
AP Environmental Science
AP Macroeconomics
AP Microeconomics
AP Physics 2
AP Psychology
AP Statistics
AP US Government
AP US History
AP World History
AMC 103
AMC 123
Introductory Sommelier (theory knowledge)
Certified Sommelier (theory knowledge)
Advanced Sommelier (theory knowledge)
Leetcode (easy)
Leetcode (medium)
Leetcode (hard)