At least some parts of automated safety research are probably differentially accelerated though (vs. capabilities), for reasons I discuss in the appendix of this presentation (in summary, that a lot of prosaic alignment research has [differentially] short horizons, both in ‘human time’ and in ‘GPU time’): https://docs.google.com/presentation/d/1bFfQc8688Fo6k-9lYs6-QwtJNCPOS8W2UH5gs8S6p0o/edit?usp=drive_link.
Large parts of interpretability are also probably differentially automatable (as is already starting to happen, e.g. https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse; https://multimodal-interpretability.csail.mit.edu/maia/), both for task horizon reasons (especially if combined with something like SAEs, which would help by e.g. leading to sparser, more easily identifiable circuits / steering vectors, etc.) and for (more basic) token cheapness reasons: https://x.com/BogdanIonutCir2/status/1819861008568971325.
At least some parts of automated safety research are probably differentially accelerated though (vs. capabilities), for reasons I discuss in the appendix of this presentation (in summary, that a lot of prosaic alignment research has [differentially] short horizons, both in ‘human time’ and in ‘GPU time’): https://docs.google.com/presentation/d/1bFfQc8688Fo6k-9lYs6-QwtJNCPOS8W2UH5gs8S6p0o/edit?usp=drive_link.
Large parts of interpretability are also probably differentially automatable (as is already starting to happen, e.g. https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse; https://multimodal-interpretability.csail.mit.edu/maia/), both for task horizon reasons (especially if combined with something like SAEs, which would help by e.g. leading to sparser, more easily identifiable circuits / steering vectors, etc.) and for (more basic) token cheapness reasons: https://x.com/BogdanIonutCir2/status/1819861008568971325.