Yeah, by ‘scaffolding’ I’m imagining something significantly more than this. Like, feedback that is conditional on the responses given, at minimum.
Something like:
“Looks like you generated only one hypothesis. Before you continue, try generating multiple hypotheses that could explain this.”
“Looks like you just found evidence that disproves hypothesis 1. Can you now disprove hypothesis 2?”
“Looks like you’ve disproven all the hypotheses you’ve come up with so far. Time to brainstorm more!”
Perhaps include some text in the first prompt like:
T. C. Chamberlin’s “Method of Multiple Working Hypotheses”: An encapsulation for modern students
L. Bruce Railsback
Department of Geology, University of Georgia, Athens, Georgia 30602-2501 USA
Introduction
Scientific study designed to increase our knowledge of natural phenomena can follow at least three different intellectual methods. These can be called the method of the ruling theory, the method of the working hypothesis, and the method of multiple working hypotheses. The first two are the most popular but they can, and often do, lead to ineffective research that overlooks relevant data. Instead, the method of multiple working hypotheses offers a more effective way of organizing one’s research.
Ruling Theories and Working Hypotheses
Our desire to reach an interpretation or explanation commonly leads us to a tentative interpretation that is based on relatively hasty examination of a single example or case. Our tentative explanation, as such, is not a threat to objectivity, but if we then begin to trust it without further testing, we can be blinded to other possibilities that we ignored at first glance. Our premature explanation can become a tentative theory and then a ruling theory, and our research becomes focused on proving that ruling theory. The result is a blindness to evidence that disproves the ruling theory or supports an alternate explanation. Only if the original tentative hypothesis was by chance correct does our research lead to any meaningful contribution to knowledge.
Seemingly less insidious is the working hypothesis. The working hypothesis, we are told, is a hypothesis to be tested, not in order to prove the hypothesis, but as a stimulus for study and fact-finding. Nonetheless, the single working hypothesis can imperceptibly degenerate into a ruling theory, and our desire to prove the working hypothesis, despite evidence to the contrary, can become as strong as the desire to prove the ruling theory.
Multiple Working Hypotheses
The method of multiple working hypotheses involves the development, prior to our research, of several hypotheses that might explain the phenomenon we want to study. Many of these hypotheses will be contradictory, so that some, if not all, will prove to be false. However, the development of multiple hypotheses prior to the research lets us avoid the trap of the ruling hypothesis and thus makes it more likely that our research will lead to meaningful results. We open-mindedly envision all the possible explanations of the phenomenon to be studied, including the possibility that none of explanations are correct (“none of the above”) and the possibility that some new explanation may emerge.
The method of multiple working hypotheses has several other beneficial effects on one’s research. Careful study often shows that a phenomenon is the result of several causes, not just one, and the method of multiple working hypotheses obviously makes it more likely that we will see the interaction of the several causes. The method also promotes much greater thoroughness than research directed toward one hypothesis, leading to lines of inquiry that we might otherwise overlook, and thus to evidence and insights that single-minded research might never have encountered. Thirdly, the method makes us much more likely to see the imperfections in our knowledge and thus to avoid the pitfall of accepting weak or flawed evidence for one hypothesis when another provides a more elegant solution.
Possible Drawbacks of the Method
The method of multiple working hypotheses does have drawbacks. One is that it is impossible to express multiple hypotheses simultaneously, and thus there is a natural tendency to let one take primacy. Keeping a written, not mental, list of our multiple hypotheses is often a necessary solution to that problem.
Another problem is that an open mind may develop hypotheses that are so difficult to test that evaluating them is nearly impossible. An example might be where three of our hypotheses are testable by conventional field work, but a fourth requires drilling of a deep borehole beyond our economic resources. This fourth hypothesis need not paralyze our research, but it should provide a reminder that none of the first three need be true.
A third possible problem is that of vacillation or indecision as we balance the evidence for various hypotheses. Such vacillation may be bad for the researcher, but such vacillation is preferable to the premature rush to a false conclusion.
An Example
The field discovery of a breccia provides an excellent example of the application of the method of multiple working hypotheses. A breccia may form in many ways: by deposition as talus, by collapse after dissolution of underlying evaporites or other soluble rocks, by faulting, by bolide impact, or by other means. Each of the possibilities can be supported by various field evidence, for which we could look if we were evaluating all these hypotheses. However, if we chose just one hypothesis, we might ignore other evidence more clearly supportive of a different hypothesis. For example, if we hypothesized that our breccia was the result of cataclasis during faulting, we might find that the breccia occurred along a fault. We would then accept our single hypothesis and quit looking for additional information. However, if we were using multiple working hypotheses and looked for evidence supporting or disproving all our hypotheses, we might also notice that the breccia was localized in a circular pattern along just one part of the fault. Further examination might show that it was accompanied by shatter cones. Armed with this additional information, we would be more inclined to an interpretation involving an impact that was by chance coincident with a fault. By looking for evidence supportive of a variety of hypotheses, we would have avoided an incorrect interpretation based on coincidence.
Summary
In using the method of multiple working hypotheses, we try to open-mindedly envision and list all the possible hypotheses that could account for the phenomenon to be studied. This induces greater care in ascertaining the facts and greater discrimination and caution in drawing conclusions. Although our human tendencies lead us toward the method of the ruling theory, the method of multiple working hypotheses offers the best chance of open-minded research that avoids false conclusions.
Something I’m wrestling with on this project is the balance between testing the models’ ability to do science (which I want to do) and finding ways to make them better at doing science (which I basically don’t want to do and especially don’t want to publish). Doing a lot of iteration on improving scaffolding feels to me like it starts to tip over into the latter (whereas doing bog-standard few-shotting or fine-tuning doesn’t).
To be clear, I don’t have strong reason to expect that we’d find approaches that are significant boosts to what’s already out there. But it could happen, and I’m trying to be cautious about that, in the interest of not further accelerating capabilities improvements.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
Examples:
The exact scaffolding used by Sakana AI did not propel AGI capabilities as much compared to the common knowledge it created that LLMs can somewhat do end-to-end science.
No amount of scaffolding that the Arc AGI or Frontier Math team could build would have as much of an impact on AGI capabilities as the benchmarks themselves. These benchmark results basically validated that the direction OpenAI is taking is broadly correct, and I suspect many people who weren’t fully sold on test-time compute will now change strategies as a result of that.
Hard benchmarks of meaningful tasks serve as excellent metrics to measure progress, which is great for capabilities research. Of course, they are also very useful for making decisions that need to be informed by an accurate tracking or forecasting of capabilities.
Whether making hard meaningful benchmarks such as frontier math and arc agi and LLM science are net negative or positive is unclear to me (a load-bearing question is whether the big AGI labs have internal benchmarks as good as these already that they can use instead). I do think however that you’d have to be extraordinarily excellent at designing scaffolding (and finetuning and the like) and even then spend way too much effort at it to do significant harm from the scaffolding itself rather than the benchmark that the scaffolding was designed for.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
You may be right. That said, I’m pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can’t know what safety measures are needed or whether those measures are succeeding.
For what it’s worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I’ll support it.
Yeah, by ‘scaffolding’ I’m imagining something significantly more than this. Like, feedback that is conditional on the responses given, at minimum.
Something like:
“Looks like you generated only one hypothesis. Before you continue, try generating multiple hypotheses that could explain this.”
“Looks like you just found evidence that disproves hypothesis 1. Can you now disprove hypothesis 2?”
“Looks like you’ve disproven all the hypotheses you’ve come up with so far. Time to brainstorm more!”
Perhaps include some text in the first prompt like:
Got it.
Something I’m wrestling with on this project is the balance between testing the models’ ability to do science (which I want to do) and finding ways to make them better at doing science (which I basically don’t want to do and especially don’t want to publish). Doing a lot of iteration on improving scaffolding feels to me like it starts to tip over into the latter (whereas doing bog-standard few-shotting or fine-tuning doesn’t).
To be clear, I don’t have strong reason to expect that we’d find approaches that are significant boosts to what’s already out there. But it could happen, and I’m trying to be cautious about that, in the interest of not further accelerating capabilities improvements.
I strongly suspect that publishing the benchmark and/or positive results of AI on the benchmark pushes capabilities much more than publishing simple scaffolding + fine-tuning solutions that do well on the benchmark for benchmarks that measure markers of AI progress.
Examples:
The exact scaffolding used by Sakana AI did not propel AGI capabilities as much compared to the common knowledge it created that LLMs can somewhat do end-to-end science.
No amount of scaffolding that the Arc AGI or Frontier Math team could build would have as much of an impact on AGI capabilities as the benchmarks themselves. These benchmark results basically validated that the direction OpenAI is taking is broadly correct, and I suspect many people who weren’t fully sold on test-time compute will now change strategies as a result of that.
Hard benchmarks of meaningful tasks serve as excellent metrics to measure progress, which is great for capabilities research. Of course, they are also very useful for making decisions that need to be informed by an accurate tracking or forecasting of capabilities.
Whether making hard meaningful benchmarks such as frontier math and arc agi and LLM science are net negative or positive is unclear to me (a load-bearing question is whether the big AGI labs have internal benchmarks as good as these already that they can use instead). I do think however that you’d have to be extraordinarily excellent at designing scaffolding (and finetuning and the like) and even then spend way too much effort at it to do significant harm from the scaffolding itself rather than the benchmark that the scaffolding was designed for.
You may be right. That said, I’m pretty skeptical of fully general arguments against testing what LLMs are capable of; without understanding what their capabilities are we can’t know what safety measures are needed or whether those measures are succeeding.
For what it’s worth, though, I have no particular plans to publish an official benchmark or eval, although if a member of my team is excited to work on that I’ll support it.