feed lots of good papers in [situational awareness / out-of-context reasoning / …] into GPT-4′s context window,
ask it to generate 100 follow-up research ideas,
ask it to develop specific experiments to run for each of those ideas,
feed those experiments for GPT-4 copies equipped with a coding environment,
write the results to a nice little article and send it to a human.
Obvious, but perhaps worth reminding ourselves, that this is a recipe for automating/speeding-up AI research in general, so would be a neutral at best update for AI safety if it worked.
It does seem that for automation to have a disproportionately large impact for AI alignment it would have to be specific to the research methods used in alignment. This may not necessarily mean automating the foundational and conceptual research you mention, but I do think it necessarily does not look like your suggest pipeline.
Two examples might be: a philosophically able LLM that can help you de-confuse your conceptual/foundational ideas; automating mech-interp (e.g. existing work on discovering and interpreting features) in a way that does not generalise well to other AI research directions.
Obvious, but perhaps worth reminding ourselves, that this is a recipe for automating/speeding-up AI research in general, so would be a neutral at best update for AI safety if it worked.
It does seem that for automation to have a disproportionately large impact for AI alignment it would have to be specific to the research methods used in alignment. This may not necessarily mean automating the foundational and conceptual research you mention, but I do think it necessarily does not look like your suggest pipeline.
Two examples might be: a philosophically able LLM that can help you de-confuse your conceptual/foundational ideas; automating mech-interp (e.g. existing work on discovering and interpreting features) in a way that does not generalise well to other AI research directions.
At least some parts of automated safety research are probably differentially accelerated though (vs. capabilities), for reasons I discuss in the appendix of this presentation (in summary, that a lot of prosaic alignment research has [differentially] short horizons, both in ‘human time’ and in ‘GPU time’): https://docs.google.com/presentation/d/1bFfQc8688Fo6k-9lYs6-QwtJNCPOS8W2UH5gs8S6p0o/edit?usp=drive_link.
Large parts of interpretability are also probably differentially automatable (as is already starting to happen, e.g. https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse; https://multimodal-interpretability.csail.mit.edu/maia/), both for task horizon reasons (especially if combined with something like SAEs, which would help by e.g. leading to sparser, more easily identifiable circuits / steering vectors, etc.) and for (more basic) token cheapness reasons: https://x.com/BogdanIonutCir2/status/1819861008568971325.