Minerva

Link post

Google Research’s new AI tackles natural language math problems and handily outperforms the SOTA^[1]. It is a pre-trained PaLM ^[2]finetuned on some maths datasets (which use LaTeX) composed of maths webpages and Arxiv papers (38.5B tokens). The three models trained were as follows.

When generating answers, Minerva is given the same prompt of four questions with correct a chain of reasoning and a consistent format for the final, correct answer. Then the actual question is given. Minerva then outputs a chain of reasoning and a corresponding answer a number of times, with the most common answer chosen. Minerva is graded only on the final answer.

This voting algorithm is called maj@1k and saturates faster than pass@k (generates k answers, if one is right then the answer is graded correctly) but doesn’t perform as well for large k. This is quite reasonable, as majority voting will continue to choose the most common answer, with the estimate’s error decreasing with larger k. Whereas pass@k allows the model more tries for large k.

Datasets

The datasets used are:

MATH: High school math competition level problems

MATH dataset. Note that a PhD CS student who wasn’t fond of maths achieved 40% accuracy on this dataset, and a three time IMO gold medalist achieved 90%.

MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.

GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.

The datasets have questions which vary in difficulty. Predictably, the model performed worse on harder questions, with false positives linearly with question difficulty on

Results

Now time for a suprise quiz! For the purposes of this quiz, assume we’re talking about the most accurate minerva model (540B parameters using maj1@k sampling. k=64 for MATH and k=16 for MMLU). And we’ll be averaging over results on subtopics^[3]. Note the SOTA is OpenAI’s davinci-002, which obtained absolute (averaged) scores of about 20% and 49%.

Did Minverva (540B maj1@64) outperform the SOTA on the Math dataset by 190% in relative terms?

Minerva

Datasets

Results

Random Remarks