Google Research’s new AI tackles natural language math problems and handily outperforms the SOTA[1]. It is a pre-trained PaLM [2]finetuned on some maths datasets (which use LaTeX) composed of maths webpages and Arxiv papers (38.5B tokens). The three models trained were as follows.
When generating answers, Minerva is given the same prompt of four questions with correct a chain of reasoning and a consistent format for the final, correct answer. Then the actual question is given. Minerva then outputs a chain of reasoning and a corresponding answer a number of times, with the most common answer chosen. Minerva is graded only on the final answer.
This voting algorithm is called maj@1k and saturates faster than pass@k (generates k answers, if one is right then the answer is graded correctly) but doesn’t perform as well for large k. This is quite reasonable, as majority voting will continue to choose the most common answer, with the estimate’s error decreasing with larger k. Whereas pass@k allows the model more tries for large k.
MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.
GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.
The datasets have questions which vary in difficulty. Predictably, the model performed worse on harder questions, with false positives linearly with question difficulty on
Results
Now time for a suprise quiz! For the purposes of this quiz, assume we’re talking about the most accurate minerva model (540B parameters using maj1@k sampling. k=64 for MATH and k=16 for MMLU). And we’ll be averaging over results on subtopics[3]. Note the SOTA is OpenAI’s davinci-002, which obtained absolute (averaged) scores of about 20% and 49%.
Minerva
Link post
Google Research’s new AI tackles natural language math problems and handily outperforms the SOTA[1]. It is a pre-trained PaLM [2]finetuned on some maths datasets (which use LaTeX) composed of maths webpages and Arxiv papers (38.5B tokens). The three models trained were as follows.
When generating answers, Minerva is given the same prompt of four questions with correct a chain of reasoning and a consistent format for the final, correct answer. Then the actual question is given. Minerva then outputs a chain of reasoning and a corresponding answer a number of times, with the most common answer chosen. Minerva is graded only on the final answer.
This voting algorithm is called maj@1k and saturates faster than pass@k (generates k answers, if one is right then the answer is graded correctly) but doesn’t perform as well for large k. This is quite reasonable, as majority voting will continue to choose the most common answer, with the estimate’s error decreasing with larger k. Whereas pass@k allows the model more tries for large k.
Datasets
The datasets used are:
MATH: High school math competition level problems
MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.
GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.
The datasets have questions which vary in difficulty. Predictably, the model performed worse on harder questions, with false positives linearly with question difficulty on
Results
Now time for a suprise quiz! For the purposes of this quiz, assume we’re talking about the most accurate minerva model (540B parameters using maj1@k sampling. k=64 for MATH and k=16 for MMLU). And we’ll be averaging over results on subtopics[3]. Note the SOTA is OpenAI’s davinci-002, which obtained absolute (averaged) scores of about 20% and 49%.