METR (formerly ARC Evals) included results on base models in their recent work “Measuring the impact of post-training enhancements” (“post-training enhancements”=elicitation). They found that GPT-4-base performed poorly in their scaffold and prompting.
I believe the prompting they used included a large number of few-show examples (perhaps 10?), so it should be a vaguely reasonable setup for base models. (Though I do expect that elicitation which is more specialized to base model would work better.)
I predict that base models will consistently do worse on tasks that labs care about (software engineering, agency, math) then models which have gone through post-training, particularly models which have gone through post training just aimed at improving capabilities and improving the extent to which the model follows instructions (instruction tuning).
My overall sense is that there is plausibly a lot of low hanging fruit in elicitation, but I’m pretty skeptical that base models are a very promising direction.
METR (formerly ARC Evals) included results on base models in their recent work “Measuring the impact of post-training enhancements” (“post-training enhancements”=elicitation). They found that GPT-4-base performed poorly in their scaffold and prompting.
I believe the prompting they used included a large number of few-show examples (perhaps 10?), so it should be a vaguely reasonable setup for base models. (Though I do expect that elicitation which is more specialized to base model would work better.)
I predict that base models will consistently do worse on tasks that labs care about (software engineering, agency, math) then models which have gone through post-training, particularly models which have gone through post training just aimed at improving capabilities and improving the extent to which the model follows instructions (instruction tuning).
My overall sense is that there is plausibly a lot of low hanging fruit in elicitation, but I’m pretty skeptical that base models are a very promising direction.
Thank you! I’d forgotten about that.