This work has been done in the context of SaferAI’s work on risk assessment. Equal contribution by Eli and Joel. I’m sharing this writeup in the form of a Google Doc and reproducing the summary below.
Disclaimer: this writeup is context for upcoming experiments, not complete work. As such it contains a lot of (not always well-justified) guess-work and untidy conceptual choices. We are publishing now despite this to get feedback.
If you are interested in this work — perhaps as a future collaborator or funder, or because this work could provide helpful input into e.g. risk assessments or RSPs — please get in touch with us at joel@qallys.com and/or simeon@safer-ai.org.
Summary
A recent report documented how the performance of AI models can be improved after training, via post-training enhancements (PTEs) such as external tools, scaffolding, and fine-tuning. The gain from a PTE is measured in compute-equivalent gains (CEG): the multiplier on training compute required to achieve equivalent performance to a model combined with a PTE.
We are interested in understanding the contribution that PTEs make to AI system capabilities over time.
This question in turn is motivated by SaferAI’s work on quantitative risk assessments of frontier models. In particular, any risk assessment of open-sourcing models or of having closed-source models stolen or leaked should take into account PTEs. A system’s capabilities will increase over time as PTEs are added to the system built on top of a given base model.
We extend a recent analysis of PTEs in order to understand the trend in CEG over time, arriving at very rough estimates for the rate of improvement of PTEs. Our primary takeaways are that current data is insufficient and experiments are needed to better forecast the effects of PTEs, as described below.
There are serious limitations in our preliminary analysis, including: problems with the CEG metric, many uninformed parameter estimates, and reliance on an ill-defined “average task”.
High-priority future work includes running experiments to get more evidence on important uncertainties for our forecasts of capability gains due to PTEs. In particular, we think it will be important to understand how well different PTEs combine, as well as to directly study performance on benchmarks relevant to dangerous capabilities rather than relying on the CEG and average task abstractions.
In this write-up, we will:
Yes, it’s a great topic. The aspect which seems to be missing from “AI capabilities can be significantly improved without expensive retraining”, https://arxiv.org/abs/2312.07413 is that post-training is a particularly fertile ground for rapid turnaround self-modification and recursive self-improvement, as post-training tends to be rather lightweight and usually does not include a delay of training a novel large model.
Some recent capability works in that direction include, for example
“Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation”, https://arxiv.org/abs/2310.02304
“Language Agents as Optimizable Graphs”, https://arxiv.org/abs/2402.16823
People who are specifically concerned with rapid foom risks might want to focus on this aspect of the situation. These self-improvement methods currently saturate in a reasonably safe zone, but they are getting stronger both due to novel research, and due to improvements of the underlying LLMs they tend to rely upon.
An important and neglected topic!
Also, a challengingly complicated topic. Been thinking a lot about this myself recently, in looking at possible interactions between general models and use-case-specific software or models. For instance, in biology, if an LLM agent can query an API for a tool like AlphaFold, or search records produced by such a tool, and then add the results of the query to their context before answering a user’s question… The result is a much more powerful and easy-to-use system than either LLM or narrow tool alone.