Quick take: on the margin, a lot more research should probably be going into trying to produce benchmarks/datasets/evaluation methods for safety (than more directly doing object-level safety research).
Including differentially vs. doing the full stack of AI safety work—because I expect a lot of such work could be done by automated safety researchers soon, with the right benchmarks/datasets/evaluation methods. This could also make it much easier to evaluate the work of automated safety researchers and to potentially reduce the need for human feedback, which could be a significant bottleneck.
Better proxies could also make it easier to productively deploy more inference compute—e.g. from Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: ‘When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.’ Similar findings in other references, e.g. in Trading Off Compute in Training and Inference.
Of course, there’s also the option of using automated/augmented safety research to produce such benchmarks, but it seems potentially particularly useful to have them ahead of time.
This also seems like an area where we can’t piggyback off of work from capabilities researchers, like will probably be significantly the case for applying automated researchers (to safety).
automating research that ‘tackles the hard bits’ seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback—which does suggest it might be valuable to recruit more agent foundations researchers today
but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops
I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback
the hard bits might also become much less hard—or prove to be irrelevant—if for at least some of them empirical feedback could be obtained by practicing on much more powerful models
Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research.
(Edit: but also, I think if funders and field builders ‘went hard’, it might be much less necessary to choose.)
Quick take: on the margin, a lot more research should probably be going into trying to produce benchmarks/datasets/evaluation methods for safety (than more directly doing object-level safety research).
Some past examples I find valuable—in the case of unlearning: WMDP, Eight Methods to Evaluate Robust Unlearning in LLMs; in the case of mech interp—various proxies for SAE performance, e.g. from Scaling and evaluating sparse autoencoders, as well as various benchmarks, e.g. FIND: A Function Description Benchmark for Evaluating Interpretability Methods. Prizes and RFPs seem like a potentially scalable way to do this—e.g. https://www.mlsafety.org/safebench—and I think they could be particularly useful on short timelines.
Including differentially vs. doing the full stack of AI safety work—because I expect a lot of such work could be done by automated safety researchers soon, with the right benchmarks/datasets/evaluation methods. This could also make it much easier to evaluate the work of automated safety researchers and to potentially reduce the need for human feedback, which could be a significant bottleneck.
Better proxies could also make it easier to productively deploy more inference compute—e.g. from Large Language Monkeys: Scaling Inference Compute with Repeated Sampling: ‘When solving math word problems from GSM8K and MATH, coverage with Llama-3 models grows to over 95% with 10,000 samples. However, common methods to pick correct solutions from a sample collection, such as majority voting or reward models, plateau beyond several hundred samples and fail to fully scale with the sample budget.’ Similar findings in other references, e.g. in Trading Off Compute in Training and Inference.
Of course, there’s also the option of using automated/augmented safety research to produce such benchmarks, but it seems potentially particularly useful to have them ahead of time.
This also seems like an area where we can’t piggyback off of work from capabilities researchers, like will probably be significantly the case for applying automated researchers (to safety).
I’m torn between generally being really fucking into improving feedback loops (and thinking they are a good way to make it easier to make progress on confusing questions), and, being sad that so few people are actually just trying to actually directly tackle the hard bits of the alignment challenge.
Some quick thoughts:
automating research that ‘tackles the hard bits’ seems to probably be harder (happen chronologically later) and like it might be more bottlenecked by good (e.g. agent foundations) human researcher feedback—which does suggest it might be valuable to recruit more agent foundations researchers today
but my impression is that recruiting/forming agent foundations researchers itself is much harder and has much worse feedback loops
I expect if the feedback loops are good enough, the automated safety researchers could make enough prosaic safety progress, feeding into agendas like control and/or conditioning predictive models (+ extensions to e.g. steering using model internals, + automated reviewing, etc.), to allow the use of automated agent foundations researchers with more confidence and less human feedback
the hard bits might also become much less hard—or prove to be irrelevant—if for at least some of them empirical feedback could be obtained by practicing on much more powerful models
Overall, my impression is that focusing on the feedback loops seems probably significantly more scalable today, and the results would probably be enough to allow to mostly defer the hard bits to automated research.
(Edit: but also, I think if funders and field builders ‘went hard’, it might be much less necessary to choose.)
Potentially also https://www.lesswrong.com/posts/yxdHp2cZeQbZGREEN/improving-model-written-evals-for-ai-safety-benchmarking.