I have an idea for testing this approach, before getting authors to write tens of thousands of pages of annotated dungeon tests.
It’s hard to generate explanations of prose, but easy, for a computer, to generate explanations of particular subsets of math. For example, WolframAlpha can explain its reasoning for finding the derivative of a polynomial (click “step by step solution”, then “show all steps”):
Wolfram Alpha derivative example
There’s a wide variety of math problems which we can programmatically solve, and can therefore programmatically generate explanations for:
Arithmetic, like step-by-step long division
Derivatives over a large set of operations (but not integrals; those are harder)
Subsets of logic
Subsets of integer programming
Some varieties of logic puzzles, like “knights and knaves” and “Alice, Beth, and Cara live in houses 1, 2, and 3, and have favorite colors Red, Green, and Blue non-respectively; here are some clues to figure out which is which”.
Simple algebra, like multiplying polynomials
(Actually, most of these are probably too hard to learn. Should focus on the really simple ones like long division.)
The idea is to:
Programmatically generate a large quantity of a small variety of math problems with explanations; then
Train one transformer on just the problem and final answer; and
Train another transformer on the problem, explanation, and final answer.
This is a very different domain than English prose, so it won’t tell you anything definitive about that more important domain. But it’s easier to do, and it shouldn’t carry any risk of advancing AI capabilities, since the training set is already (by definition) something we can already solve more accurately by other means.
I imagine you could learn a few things about how the explanations influence the AI:
You can see whether the explanation helps teach the AI, by checking whether the second transformer outperforms the first.
You can see whether the AI actually “uses” the explanation, by looking at the pattern of mistakes. If the AI frequently bungles the explanation while writing down the correct final answer, it must be generating the explanation and answer separately. This would be a bad sign for “visible thought” alignment.
You can see whether the AI naturally “hides” mistakes in its reasoning. I wouldn’t be surprised to frequently see a chain of reasoning “A → B → C → D → E → F”, where A, B, E, and F are right and C and D are wrong, since it’s often easier to check the beginning and end of a proof. For example, students do this sometimes.
Relevant: From OpenAI’s “Training Verifiers To Solve Math Word Problems”: “We also note that it is important to allow the model to generate the full natural language solution before outputting a final answer. If we instead finetune a 6B model to directly output the final answer without any intermediate steps, performance drops drastically from 20.6% to 5.2%.” Also the “exploration” linked in the post, as well as my own little exploration restricted to modulo operations on many-digit numbers (via step-by-step long division!), on which LMs do very poorly without generating intermediate steps. (But see also Hendryks et al: “We also experiment with using step-by-step solutions. We find that having models generate their own step-by-step solutions before producing an answer actually degrades accuracy. We qualitatively assess these generated solutions and find that while many steps remain illogical, they are often related to the question. Finally, we show that step-by-step solutions can still provide benefits today. We find that providing partial ground truth step-by-step solutions can improve performance, and that providing models with step-by-step solutions at training time also increases accuracy.”)
I have an idea for testing this approach, before getting authors to write tens of thousands of pages of annotated dungeon tests.
It’s hard to generate explanations of prose, but easy, for a computer, to generate explanations of particular subsets of math. For example, WolframAlpha can explain its reasoning for finding the derivative of a polynomial (click “step by step solution”, then “show all steps”): Wolfram Alpha derivative example
There’s a wide variety of math problems which we can programmatically solve, and can therefore programmatically generate explanations for:
Arithmetic, like step-by-step long division
Derivatives over a large set of operations (but not integrals; those are harder)
Subsets of logic
Subsets of integer programming
Some varieties of logic puzzles, like “knights and knaves” and “Alice, Beth, and Cara live in houses 1, 2, and 3, and have favorite colors Red, Green, and Blue non-respectively; here are some clues to figure out which is which”.
Simple algebra, like multiplying polynomials
(Actually, most of these are probably too hard to learn. Should focus on the really simple ones like long division.)
The idea is to:
Programmatically generate a large quantity of a small variety of math problems with explanations; then
Train one transformer on just the problem and final answer; and
Train another transformer on the problem, explanation, and final answer.
This is a very different domain than English prose, so it won’t tell you anything definitive about that more important domain. But it’s easier to do, and it shouldn’t carry any risk of advancing AI capabilities, since the training set is already (by definition) something we can already solve more accurately by other means.
I imagine you could learn a few things about how the explanations influence the AI:
You can see whether the explanation helps teach the AI, by checking whether the second transformer outperforms the first.
You can see whether the AI actually “uses” the explanation, by looking at the pattern of mistakes. If the AI frequently bungles the explanation while writing down the correct final answer, it must be generating the explanation and answer separately. This would be a bad sign for “visible thought” alignment.
You can see whether the AI naturally “hides” mistakes in its reasoning. I wouldn’t be surprised to frequently see a chain of reasoning “A → B → C → D → E → F”, where A, B, E, and F are right and C and D are wrong, since it’s often easier to check the beginning and end of a proof. For example, students do this sometimes.
Relevant: From OpenAI’s “Training Verifiers To Solve Math Word Problems”: “We also note that it is important to allow the model to generate the full natural language solution before outputting a final answer. If we instead finetune a 6B model to directly output the final answer without any intermediate steps, performance drops drastically from 20.6% to 5.2%.” Also the “exploration” linked in the post, as well as my own little exploration restricted to modulo operations on many-digit numbers (via step-by-step long division!), on which LMs do very poorly without generating intermediate steps. (But see also Hendryks et al: “We also experiment with using step-by-step solutions. We find that having models generate their own step-by-step solutions before producing an answer actually degrades accuracy. We qualitatively assess these generated solutions and find that while many steps remain illogical, they are often related to the question. Finally, we show that step-by-step solutions can still provide benefits today. We find that providing partial ground truth step-by-step solutions can improve performance, and that providing models with step-by-step solutions at training time also increases accuracy.”)